Why I now use “four-threshold” flags on dashboards (book excerpt)

tl;dr: In order for a dashboard to gain traction among users, it must visually flag metrics in order to quickly draw users’ attention to metrics that require it. Unfortunately, though, the most common methods for flagging metrics on dashboards (“Vs. previous period,” “Single-threshold flags,” “% deviation from target flags,” and “Good/Satisfactory/Poor ranges”) fail to serve this purpose since they often flag metrics that don’t require attention, fail to flag metrics that do, and can be slow to visually scan. In this post, I discuss the “four-threshold” method that I now use, since it doesn’t have these fundamental limitations.

This post is an excerpt from my upcoming book, Beyond Dashboards, and is the seventh in an eight-part series of posts on how to determine which metrics to visually flag on a dashboard.

Until a year or two ago, most of the dashboards that I designed used one of three common methods that I’ve discussed in previous posts for determining which metrics to visually flag in order to draw users’ attention to metrics that require it. After realizing that all of these methods had serious drawbacks and limitations that reduced the overall effectiveness of dashboards that implemented them, though, I started using the “four-threshold” method instead, which produces visual flags that look like this:

avg calls per hour - small.png

In case you haven’t already guessed, red flags indicate problems that need to be dealt with, green flags indicate metrics that are doing unexpectedly well, and the intensity (darkness) of the flag’s color indicates the severity or the “extraordinariness” of a metric’s current value. The absence of a flag indicates that a metric doesn’t require any action at the moment. The color of each flag is determined based on four threshold values that need to be set for each metric:

Image 2.png

Each of the four thresholds has a very specific, precise definition, which is one reason why the four-threshold method works well:

  • Crisis: The point at which improving the metric would become the user’s top and possibly only priority (represented as a flag with the darkest red color).

  • Actionably Bad: The point at which the metric would be just bad enough that the user would actually do something about it, such as sending an email to a subordinate or reallocating budget (lightest red).

  • Actionably Good: The point at which the metric would be just good enough that the user would actually do something about it, such as calling a subordinate to congratulate them or adjusting expectations upward (lightest green).

  • Extraordinary: The point at which the metric would exceed the user’s most optimistic expectations and also, therefore, a point that the metric will probably never reach (darkest green).

The four-threshold method avoids many of the problems with conventional flagging methods: 

  • Crises don’t get buried. On a day when multiple metrics on a dashboard are flagged, users can immediately see if there are any real crises among them or if there are only minor issues to deal with. On a bad day when many metrics are flagged, users can quickly see where to focus first (i.e., the darkest flags) instead of seeing a wall of identical-looking flags that they need to click on, one by one, in order to prioritize. Similarly, metrics that are performing stunningly well (i.e., dark green) are visually distinct from that those that are performing well but not extraordinarily so (i.e., light green), which further helps users to quickly prioritize their efforts.

  • No false alarms. Unlike the “Vs. previous period” and “% deviation from target” flagging methods, the four-threshold method only flags metrics if they genuinely require the user to actually do something about them. If, on a given day, there are no metrics that require the user to take action, the dashboard will have no flags on it, enabling the user to see at a glance that no action is required of them at the moment and that they can move on to other work.

  • Lower risk of “Christmas tree syndrome.” Dashboards that use conventional flagging methods are prone to the dreaded “Christmas tree syndrome,” wherein many metrics end up getting flagged even on normal days. When setting the four thresholds for a metric, though, the nature of the four thresholds makes users less prone to asking themselves questions like “Exactly where do I want this metric to be?” and more prone to asking questions like “At what point would I actually do something about this metric?” which tend to yield wider “no flag” ranges. This is both more realistic and more useful, and lowers the risk of Christmas tree syndrome.

  • “Higher is worse” metrics blend in seamlessly. With some conventional flagging methods, metrics for which higher values are worse (e.g., Expenses) don’t blend in seamlessly with metrics for which higher values are better (e.g., Revenue). In these cases, the user must mentally switch gears every time they encounter a “higher is worse” metric on a dashboard in order to interpret the flags correctly, making the dashboard slower and more cumbersome to review. For a “higher is worse” metric, however, the four thresholds could be set like this, enabling its flags to be visually scanned seamlessly alongside “higher is better” metrics:

Image 3.png
  • “Goldilocks” metrics blend in seamlessly. For some metrics, such as “Headcount vs. Plan,” we want the metric to stay within a desirable “Goldilocks” range and deviations either above OR below that range are undesirable. Some conventional flagging methods, however, are unable to flag “Goldilocks” metrics effectively. The four-threshold values for a “Goldilocks” metric could be set like this, though, enabling its flags to be visually scanned seamlessly alongside regular, non-Goldilocks metrics:

Image 4.png
  • The flags are “information-dense.” Despite being more informative than conventional dashboard flags, four-threshold flags are the same size or smaller. On dashboards that contain a lot of information and where screen real estate is tight (i.e., on most dashboards), this allows many metrics to be shown without resorting to filters, scrolling, tabs, or other information-hiding mechanisms. This also allows metrics that require attention for more than one reason (e.g., a metric that’s both well above our internal expectations AND well below external analysts’ expectations) to be flagged by showing more than one flag per metric, which I’ll discuss in more detail in another post.

  • It’s easier for users to set thresholds with precise definitions than thresholds with ambiguous ones. The four thresholds each have a very specific, precise definition, making it easier to set values for them than for thresholds with ambiguous names such as “Poor” or “Good” (see my last post for more on that). It’s easier for users to answer a question such as “At what point would you actually do something about this metric?” than a question such as “At what point does this metric cease to be ‘satisfactory’ and become ‘good’?”

  • It’s easier for users to set thresholds for gradual transitions than sudden jumps. With Good/Satisfactory/Poor ranges, for example, if a Profit Margin of 12% is considered “Poor,” 12.001% would be “Satisfactory.” This is, of course, not the way that users think about metrics since the transition from undesirable to desirable ranges is almost always gradual, not a sudden jump. When setting the four thresholds, however, users aren’t forced to try to express transitions between ranges as sudden jumps, making it easier to set values for each threshold. For example, if the “Crisis” threshold for Profit Margin were set to 12% and “Actionably bad” to 14%, a Profit Margin of 12.001% would be flagged as “almost a crisis” on the dashboard (an “almost darkest red” flag), not “satisfactory” or “no action required.” This more realistic model makes it easier for users to set threshold values.

Taken together, these advantages enable dashboard users to quickly and reliably see what requires their attention. Conventional flagging methods don’t fulfill this basic purpose since they frequently trigger false alarms, fail to flag metrics that require attention, fail to distinguish between minor and major issues, and can be slow and cumbersome to visually scan.

A potential concern with four-threshold flags might be that they’re not very precise, i.e., that a colored dot can’t tell users exactly how bad or good a metric is doing, or what the exact thresholds are for that metric. There are two important things to keep in mind regarding such a concern, though. Firstly, four-threshold flags are designed for use on problem-scanning displays, where the goal is to allow users to visually scan a large number of metrics and quickly spot those that require attention. When scanning for problems, users only need to be able to quickly distinguish between minor issues and major ones, and more precision would just slow them down. Secondly, four-threshold flags should ideally be paired with tooltips, popups or—even better— “metric diagnosis” displays (which I’ll discuss in a future post), which would appear when a metric is clicked or tapped, providing more detailed information about a metric’s precise threshold values, if required.

Another potential concern might be that thresholds such as “Actionably good” and “Crisis” are too subjective, and that “hard numbers” such as “% deviation from target” or “% change vs. previous period” are better because they’re “more objective.” As I discussed in previous posts, however, those “hard numbers” frequently trigger false alerts and fail to flag metrics that genuinely require attention. While very precise, they’re also very inaccurate when it comes to indicating the amount of attention that a metric is likely to require. Ultimately, human judgment is needed to answer questions such as “At what point should someone actually do something about this metric?” although that judgment is, ideally, informed by useful statistics, which I’ll discuss in my next post.

Others have suggested replacing the colored dots with numerical “severity scores” or “quality scores,” where Crisis would be indicated by a score of, say, 0-40, and Extraordinary by, say, 90-100. Showing numerical scores would introduce several problems, however. The first is that human beings visually process textual information much more slowly than graphical objects and so the dashboard would take much longer to review and, even worse, the metrics that require the most attention would no longer visually “pop out.” The second problem is that a score such as “62/100” isn’t inherently meaningful; the user needs to mentally translate it into something like “no action required” or “major problem” in order to determine if or how they should respond to it. Showing users a “no action required” or “major problem” flag eliminates this unnecessary mental translation and doesn’t force users to memorize arbitrary numerical ranges such as “0 to 40 is a crisis, 41 to 50 is a major problem, etc.” One final problem with numerical scores is that they introduce the risk of users coming to nonsensical conclusions such as, “A metric with a score of 70 is doing twice as well as one with a score of 35” or “Revenue can’t exceed $12M since it would then have a score of over 100.”

The preceding two paragraphs feature examples of what I call the “quantification bias,” where analytically minded people are sometimes uncomfortable with qualitative or subjective categories and attempt to alleviate that discomfort by converting them into numerical scores (e.g., “Crisis” becomes 0, “No action required” becomes 50, etc.). While numbers might make these values seem more objective, this is illusory, since they’re still based entirely on peoples’ judgments of what to consider “good” or “bad” for a given metric (there’s another future blog post right there).

As with any display that encodes values using a green/red palette, dashboards that use the four-threshold method should offer a colorblind-safe alternative palette since red and green can be hard to distinguish for approximately 5% of the general population. Blue/orange palettes work well for most colorblind users:

Image 5.png

There’s more to talk about when it comes to four-threshold flags, but I’ll reserve those topics for future posts since this one is already a little longer than I’d like it to be. In the next and final post in this series, I’ll introduce some simple statistics that can be used to set default values for the four thresholds automatically so that users don’t have to set them for every metric manually.

Many thanks to Daniel Zvinca, who generously put these ideas through the wringer, resulting in some important refinements to the final draft.

To be notified of future posts like this, subscribe to the Practical Reporting email list.