Understanding User Experience and Acceptance Statistics

I am still busy with my literature review of the mobile learning research that has been published over the past 5 years. One aspect that strikes me after scanning over 200 papers is the high number of user acceptance or user experience studies that use simple questionnaires with likert scales. 

While some of the studies use questionnaires that are based on standardised models such as UTAM, the vast majority uses hand crafted questionnaires without an underpinning model. Without such a model it is hard to understand the real meaning of the results, because in most cases the items that were asked are unique to the study. So the critical question is: how do we interpret the results of these studies in order to gain a better understanding of our domain?

Lets have a short analysis of the anatomy of these studies: often the tool is relatively new or a prototype. This means that not all functions are perfectly implemented so there will be participants that are negatively influenced by these drawbacks. Specifically in the context of technology enhanced learning we are confronted with another effect. If the target audience is recruited among students, we can see a novelty effect when the students start using a new service that supports learning. Another effect in these user acceptance and user experience studies with explicit prototypes is that the participants are polite. Finally, we need to consider the nature of the Likert scale that influences our reading of the results.

The Novelty Effect

The novelty effect is a commonly observed in long term studies, where an initial excitement over a new technology quickly wears off after an initial phase. The effect can be as extreme that users stop using the prototype system after the effect wears off. More commonly the participants integrate the tool into their normal activities which changes the ways and frequencies of how they use the tool. 

The important aspect of the novelty effect in user acceptance and user experience analyses is that it creates a positive bias to the results that disappears later.

My own studies indicated that the novelty effect wears off after one or two weeks of intense system use. Other studies indicate similar timeframes with extremes up to four weeks. The problem with the novelty effect is that it can be only unveiled by time series. In summative studies the effect may be present or not, we simply cannot tell. 

The impact of this effect depends on the interest of students in ICT. For example, the novelty effect is stronger but lasts shorter with computer science students while students with a humanities background appear show the effect to a lesser extend but longer. 

It is therefore very important to analyse the conditions of a study in order to estimate the presence of the novelty effect. For example, If a paper reports about experiments or short case studies that do not last longer than four weeks, one can expect the novelty effect to be prominently present. 

Polite Participants

The second effect that influences the results is the politeness of the participants. Also this effect creates a positive bias to the responses of the participants. I observed this effect in trainings for software prototypes in which the participants clearly indicated that they disliked the tool. However, in the evaluation forms that were collected immediately after the training, I did not find any negative responses. It seems that if the participants are aware about the prototype status of a tool, they are more polite in their formal responses than in their remarks off the record. 

In one project I had the opportunity to run an evaluation in different stages of the tool development with the same participants. During the early prototype phase the feedback was not overly exited, but also not bad. In the evaluation at the end of the project almost two years later the responses in the evaluation forms were much more critical. These evaluations were not affected by the novelty effect, because the participants had no access to the software between the two evaluation events. However, it seemed that the participants were more aware that the second iteration of the evaluation was about the almost final version of the tool while the early development status was considered during the first evaluation. It appeared that the participants anticipated that there was  some headroom for improvement during the first evaluation round as it was clearly visible that this was a prototype. This headroom was apparently not considered in the second round, where the tool looked much more mature and worked more smoothly.

Many studies emphasize the novelty of their ICT tools and the lack of prior evaluation. This is a good indicator that we can also expect that the participants will respond more polite than they normally would if they were confronted with a more mature tool with the same functions. Also, I found indications in several studies that participants with a computer engineering background are more polite than those who don't share this background. 

The Nature of Likert Scales

Most user acceptance and user experience studies in my review rely on likert scales. Likert scales are a very useful instrument for measuring the agreement or disagreement with statements. One of the nice features of likert scales is that they project subjective opinions onto a nice interval so the responses can get analysed with statistical methods. The most common Likert scale I found in this review are five point likert scales between the extremes "strongly disagree" and "strongly agree". Interestingly, empirical methodology suggests 7 point Likert scales to produce the best results for measuring between these extremes.

The extremes are typically associated with the values 1 (worst) and 5 (best). What is often forgotten in studies is that participants perceive the center of Likert scales as "indifferent" or "neutral" if no alternative is provided. So in most cases a Likert scale really measures two intervals: The first interval is from "strongly disagree" (or I hate your tool) to indifference (I don't care); and the second interval is from indifference to "strongly agree" (I love your tool). Therefore, balanced likert scales (with uneven interval lengths) whould be modelled with values that have 0 as the center to indicate the indifference. For five point Likert scales this means a bias of 3. If we apply that bias to the entire scale we end with a scale for -2 (worst) to 2 (best), where the center is indifference. For the statistics this bias is meaningless, but it makes it much easier to read the results: the results around the center interval (that is -1 > x > 1) is a good indicator that the participants consider the presented tool more or less as insignificant. This is particularly the case if most of the answers in an evaluation fall into this interval.

Another drawback is the reporting of the results from Likert scales. Despite statistical rules for interval scales can be nicely applied, papers frequently report incomplete results. So most of the papers I have found in my current were reporting only the mean values. However, a mean value without the standard deviation does not say much about the distribution of the data. Furthermore, the median is in most cases a better indicator for the accumulated opinion of the participants because it points to the real values of the scale and not to some artificial value that was not selectable by the participants. Also the median represents the geometric distribution better than the mean value, so outliers have less impact on the result. Given the lack of data, one should read results presenting mean values more conservative. 

Putting the Things Together

When interpreting the findings from user acceptance studies without an underpinning empirical model, it is very important to focus on the context of the study. Typically, papers give indications that allow the estimations of the impact of the novelty effect and the politeness of the participants. The impact of this effect on the overall results will not be extremely strong, but it gives a good guideline if the results should be interpreted more conservative or not. 

This is particularly relevant for interpreting the results of Likert scales. While many authors tend to present their results in a positive light, their resuts are often not that positive. For most items in the related studies we find reported mean values around 3 (on 5-point Likert scales). For the user acceptance this are by no means good results although they show that the study was not entirely on the wrong track regarding these items. Particularly for short running studies, the novelty effect and the politeness of the participants should be considered around the center of the scale. So a conservative reading of mean values between 3 and 4 (or 0 and 1 on a biased scale) suggests that the participants were not considering the tool as relevant with respect to the item. This adds up. So if the results of all items report means around the center of the scale, we have can interpret this as "not really relevant" or as mobile technology becomes more common it can also mean "I've seen/ demand better". 

It is important to understand that these findings are not insignificant for research. Particularly, when exploring new functions these results can provide good indications about the effect a new technology will have on the learning process - or if this effect might get masked by other aspects that influence the learning process. However, interpreting the results overly positive without having an empirical model may lure one into expecting stronger effects or misinterpreting identified effects as directly related to the technology. Unfortunately, this exercise is for the reader, as the many authors of research paper avoid this discussion and highlight the "good" results in their findings.