Managing Risk with Usability Testing
Originally published: Sep 01, 2001 Articles Archives
By Aviva Rosenstein, Ph.D,
Usability Services Manager
At a trade show I attended earlier this month, a young CEO asked me an excellent question about the principles behind discount usability testing. He told me that he recently got a report on a usability test conducted by members of his product development team, but was inclined to discount the results because they weren’t statistically significant. He asked, “Why should I pay attention to usability test results based on the observations of only a few users? Shouldn’t the tests be based on a random sample of the whole population of users, and be statistically valid?”
I told him, “Usability tests conducted with a small number of users aren’t intended to give you statistically significant results. But, that doesn’t mean the test results are useless! Performing a study with a small number of users can be a cost-effective way of minimizing the potential risks of developing and launching unusable software or web interfaces — IF the participants for the tests are carefully selected, the study is appropriately designed and conducted, and the data is analyzed methodically and accurately.”
I want to share the explanation I gave him with all of you because understanding how usability tests help you manage risk can save you a lot of money and time.
Saying that the findings of a study are “statistically valid” doesn’t mean that the results of the test are more useful than data resulting from other empirical research approaches. It simply means that the kinds of observations made from the group participating in the study are likely to be found within the larger population, within a given range of certainty. While claims of statistical validity may be important for scientific investigators, they are not always critical or even that relevant for product developers.
This claim may sound surprising; so let me clarify it a little further. I’ll begin by defining some terms often used in conventional research.
Designing a valid measure is more complicated than it might seem. In usability tests, for example, testers frequently measure how long it takes each user to complete a given task. But, that measure may not always be an accurate reflection of the usability of the interface.
Here’s an illustration. In the past, I’ve seen reports based on using web server log data to measure how long, on average, users took to add something to an online shopping cart and proceed through a checkout process. While it might seem like “time to completion of task” is a useful measure of usability, it doesn’t account for other possible explanations for the users’ behavior. Some users may be distracted by cross-selling opportunities and leave the shopping cart task to browse for other items. Other users may be distracted by other tasks in their environment, such as a phone call or other interruption. Some users may have opened other browser windows to research prices with different vendors before completing their purchases. All of these possibilities could lead to longer “time-to-completion of task” measures, but that doesn’t make the shopping cart interface any less usable! So averaging the time-to-completion data contained in the log file is statistically valid, but completely useless in this case. Since other possible explanations for the results besides poor usability were not excluded or accounted for, the time-to-completion measure doesn’t actually measure the overall usability of the shopping cart interface.
However, reliability does not guarantee validity. It’s entirely possible to design a study that is reliable, but still does not result in valid data. In the example above, the time-to-completion average remained consistent in the server log data from week to week. This makes the measure reliable, but it does not increase its validity, since it is not an accurate measure of the shopping cart’s usability.
Researchers use sampling because studying everyone in a population is usually far too impractical and expensive. Theoretically, sampling allows the researcher to generalize their findings from the subset onto the larger population. For example, sampling a subset of potential users out of the total user population and observing how they navigate a particular interface allows one to generalize the behavior of the sample onto the larger population with a specified degree of certainty. Most people are familiar with random sampling — one in which every person in the population has an equal chance of being selected for the sample. However, there are several other types of sampling methods available to the researcher, each with its own level of precision and difficulty.
The sampling method chosen for any usability study should balance the costs of the method against the desired level of precision needed for the study. A “convenience sample” of co-workers recruited from down the hall might be inexpensive and easy, but it may not provide the most appropriate level of validity for a particular project. For example, if these testers are already familiar with the application under development, the test may result in findings that are not representative of the actual user population. Hence, the validity of a usability study based on a convenience sample may be questionable. We recommend different sampling techniques depending on the characteristics of the user population of interest and the risk levels associated with a specific project. Typically, we use methods that ensure that test participants reflect the relevant characteristics of the system’s intended user population.
Generally speaking, the larger the size of the sample, the more reliable the study will be. But, remember that reliability does not directly relate to validity! The costs of increasing the sample size (and the reliability of the study) should be balanced against the increase’s potential return on investment. Consider how much that extra reliability is worth to you. In scientific studies, the acceptable probability level for making a claim is usually set very high — at 95% or 99%. Put into straightforward terms, this means that studies are designed so that “based on our observations of the sample, we can predict that the general population will act the same way at least 95% of the time.”
While it’s important to manage risk in the business world, it’s also important to minimize costs. While it’s not difficult to design usability tests that deliver statistically valid results, the cost of conducting these tests can be prohibitive due to the large sample sizes required, and the benefits marginal. Since we recommend that you conduct usability tests early in the design process to identify areas that need improvement, it’s usually much cheaper to just fix the problems identified by a small group of testers than it would be to statistically confirm the results of the tests with a larger population of users.
Furthermore, research has demonstrated that the number of usability problems, found in a design, levels off significantly after the first six testers. The first five test participants typically discover approximately 85% of the usability problems in a task; but it might take another ten testers to find the remaining 15%. Consequently, we find that the smaller sample size is a more cost-effective choice for most development projects, even if it does not engender statistical validity.
Plus, you can use the money you saved on running that enormous study to run additional test cycles later in the development process. This not only allows you to validate the design choices made to fix the initial problems, but also ensures that no additional usability problems are introduced by the new designs.