As Tuesday’s post suggested, my thoughts on usability measurement have now crystallized. To provide a meaningful and consistent comparison of usability across demand generation vendors, you could:
1. Define a set of business scenarios that must be supported by the system. Each scenario would describe a type of marketing campaign and the system tasks required to run it. These tasks would cover system set-up, materials creation, campaign design, execution and evaluation. Some tasks would be common to several scenarios, while others would be unique. For example, pretty much any kind of campaign would involve creating an email, but only some campaigns require adding custom fields to the system database.
The result would be a grid with tasks listed down the side, scenarios across the top, and checkmarks showing which tasks are used in which scenarios. A small sample is below. Note that you can build a single task list for any combination of scenarios by simply combining the checkmarks in their columns. Thus, in the sample table, scenarios 1 and 2 require tasks 1, 2 and 3. In many cases, there will be two entries for each task, one for setting it up and another for repeating it.
|scenario 1||scenario 2||scenario 3||...|
2. Develop a specific package for each scenario, with both a task list and standard materials such as a list of users to set up, data elements to capture, email contents to produce, campaign logic, etc. You also need a standard Salesforce.com installation and perhaps company Web site to integrate during testing. Assembling these packages would be quite a bit of work, but only has to be done once. The project could start with relatively simple packages and expand them over time.
3. Have an expert user (typically a vendor employee) run through the required tasks while the tester tracks their time and results. As I noted in the earlier post, this means a single expert user is simulating different users at a real client. (These are marketing managers, operations specialists, system administrators, database managers, etc.) This makes sense if we assume that the client users will all be experts in their own areas. But it also means that the tester must assess which type of user would perform each task. The test packages would include score sheets to make capturing this information as easy as possible.
4. Check the test results by executing the scenario campaigns and identifying any errors. You need this step to ensure the work was actually completed correctly—otherwise, the experts could simply zoom through their tasks without worrying about accuracy. Part of the process would be for the tester to “respond” to the promotions to ensure that the system reacts correctly. This is another labor-intensive process. Results will be summarized in an error report that is part of the final evaluation.
5. Have users select their scenarios they wish to evaluate. Then generate reports for the tasks in those scenarios, showing the (a) tasks completed (b) time required (c) workload on different users and (d) error rates. Comparing the results for different systems will give a good sense of strengths and weaknesses.
* * *
Of course, the process won’t end with the detailed reports. People will want to combine the results in a single score that can be used to rank the vendors. **sigh**. Building such a score requires adjusting for factors including:
- differences in system features (some systems lack some features and, thus, can’t execute all the specified tasks)
- differences in the importance of different tasks
- differences in the value of difference users’ time
- the impact of handing off tasks among users (this adds time and errors that won’t be captured in a single-user test)
- differences in error rates and in the importance of different errors
- differences in the mix of tasks and their importance at different companies
I’m sure there are other factors as well. A simple approach might just be to assign scores for each separate dimension, say on a 1-10 scale, and then add them for a combined score. You could evolve a more elaborate approach over time, but the resulting figures will never have any specific meaning. Still, they should provide a reasonably valid ranking of the competitors.
The report could also supplement or replace the single figure with a graph that plots the results on two or more dimensions. For example, a classic scatter plot could position the each system based on breadth (number of tasks completed) vs. productivity (error-free tasks completed per hour). This would more clearly illustrate the trade offs between the products.
The good news in all this is that the closer you get to a specific company’s requirements, the more you can replace any generic assumptions with that company’s own data. This means that any aggregate score becomes much more meaningful for that company.
Let me clarify and expand that last point, because it’s very important. The tests just need to be done once because the results (tasks completed, work time, user workload, error rate) don’t change based on the individual client. So you could store those results in a database and then apply client-specific parameters such as scenarios chosen, task mix, and user costs to get a client-specific ranking without actually conducting any client-specific tests.
Of course, no one would purchase a system based solely on someone else’s tests. But having the data available could greatly speed the evaluation process and improve buyers’ understanding of the real differences between systems. In addition, the test scenarios themselves should be help buyers to decide what they want to see demonstrated.
(Before I lose track of it, let me point out that this approach doesn’t address the ease-of-learning component of usability. That requires a larger base of testers, or at least an assessment of what looks difficult to learn. It’s possible that assessing ease-of-learning really involves the same judgment as assessing which type of user will perform each task. Both, after all, are based on how hard the task looks. In any case, this issue needs more thought.)
What Do You Think?
Well, this all sounds just great to me, but I’m not an objective observer. I’m really curious to learn what you think (on the assumption that “you”, the readers of this blog, include many demand generation vendors and users).
Users: Would you use something like this during the selection process? Would you be able to prioritize your requirements and estimate the numbers of different tasks (campaigns, emails, landing pages, etc.) per year? What would you pay for a customized report based on your inputs? Would you prefer to do this sort of testing for yourself? If so, would you pay for the task lists and scenario packages to help you run your own tests? Would you want consulting help with conducting those tests? In general, do you actually conduct a detailed vendor comparison before making a choice?
Vendors: Does this approach seem fair? Is it very different from the standard scenarios you’ve already worked up for training and sales demonstrations? Would the standard task lists and scenarios make it easier to gather prospects' business requirements? Would the test results accelerate your sales cycles and deployment times? Would testing provide a useful benchmark for your development efforts? Would you participate in the testing, knowing the results were going to be published in a database? Would you pay to have the tests done? Would you help to fund initial development?
Everybody: Does this make sense? What flaws or risks do you see? What’s the best way to build the scenarios, task lists and packages? In particular, could be they be built cooperatively (“crowd sourced”) with a Wiki or something similar?
Please comment and ask others to comment as well.