[A version of this post appears on the O’Reilly Strata blog.]
After recently playing with SAS Visual Analytics, I’ve been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first time, conduct exploratory data analysis, with the goal of discovering interesting patterns and associations. Having used a few visualization tools myself, here’s a quick wish-list of features (culled from tools I’ve used or have seen in action).
Requires little (to no) coding
The viz tools I currently use require programming skills. Coding means switching back-and-forth between a visual (chart) and text (code). It’s nice1 to be able to customize charts via code, but when you’re in the exploratory phase not having to think about code syntax is ideal. Plus GUI-based tools allow you to collaborate with many more users.
Includes an expanded set of basic charts
Aside from statistical graphics (line, bar, scatter, histogram, bubble, boxplot,…), these days the ability to visualize hierarchies (treemap), financial (stock charts), longitudinal, geospatial (maps) and network data are essential.
Charts are easy to customize
It should be easy to tweak labels, colors, and other elements. There are times when default labels need to be resized or repositioned, to make them legible. You should also be able to adjust coloring schemes to your liking (colors are usually assigned based on category or, in the case of heat maps, value).
Templates can be created
Once you create a chart with your preferred color and labeling scheme, you should be able to templatize it for future projects. [Ideally templates support rule-based formatting (“if negative, color = red”), but this starts to involve some coding.]
Visual summaries are easy to generate (histograms, association matrix)
You’ll be exploring data sets that contain many observations (rows) and variables (columns). SAS Visual Analytics produces a quick summary (average, min/max, histogram) for each variable and displays the results in a compact, scrollable format. This is done entirely through a GUI and doesn’t require any coding.
Drill-down to source points: identify, isolate, and fix minor data errors
Visual summaries2 alert you to potential problems with your data (outliers or errors). A few tools give you the ability to isolate outliers or fix simple data problems through a GUI. More generally, it’s nice to be able to drill-down from the chart to examine (via dynamic rollover or other method) the underlying data.
While exploring data, you need to be able to quickly filter by value or category – using checkboxes, drop-downs, sliders, …
Support for visual pivoting
Many business analysts are heavy users of pivot tables – a tabular summarization technique found in spreadsheets and reporting tools. Visual pivoting replaces tabular presentation with charts. My first experience using this type of visual exploration was through the Trellis graphs introduced in S/S-Plus. Thanks to Tableau’s easy-to-use interface, this form of visual analysis has become a popular way to explore data.
Support for analytics
Many visualization tools lack analytic capabilities. From simple (error bar, quantiles) to advanced (clustering, forecasting, multidimensional scaling3), analytic tools expand what users can do. Case in point, SAS Visual Analytics has tools for conducting sensitivity analysis and forecasting (GUI-based, no coding required). An example is to take a given time-series (unit sales), plot a forecast of its behavior for the next six time periods, and study how the forecast varies when other key variables (customer satisfaction) change.
Tools for sharing, collaboration, and replication
Several tools let you publish4 your static or interactive charts, and some tools even let you subscribe5 to the work of other users. For sharing, collaboration, and documentation, it should be possible to annotate your work. Being able to collaborate with others would be nice, at a minimum one should at least be able to copy (and modify) the work of another user.
Big Data: Volume and Variety6
A tool should produce charts quickly even when it’s hitting massive data sets. Simply put, it should be truly interactive7. Several new tools target larger data sets, some are geared specifically for Hadoop users (a partial list includes Datameer, Platfora, SiSense, and SAS Visual Analytics). But there will be occasions when you’ll be working with small data sets (or be offline). To that end you should be able to visually explore small data (locally using your laptop) without having to connect to a more powerful environment (such as a cluster or a beefy server).
I haven’t come across great viz tools for exploring unstructured data, so I’ll interpret variety in a different way. Co-existence (usually of Hadoop & data warehouses) means data will continue to reside in different systems. Being able to connect to a variety of data sources is essential. (Among startups, Datameer does a good job of this.) Some tools include public data sets (e.g., US Census) and use them to generate examples.
Recommend items worth investigating8
When you first encounter a data set with lots of variables, it can be a bit overwhelming. Using simple pattern recognition techniques, tools should surface associations/patterns/anomalies worth investigating. Some tools in finance do this for time-series: trends, new highs/lows, and forecasts are drawn automatically. I’d love to have suggestions for what visual pivots (trellis charts) to draw.
(0) Thanks to Lynn Cherny for reviewing an early draft of this post and for suggesting a few features.
(1) Unless of course you have killer programming tools, a la Bret Victor. You can do some of the things described in the post using ScaleR from Revolution Analytics – but it’s a tool that requires coding in R.
(2) A good example: SAS Visual Analytics displays the number of distinct values of categorical variables. If the number of distinct values is unusually large, you likely have a data quality issue.
(3) Or other tools for handling high-dimensional data sets. Still waiting for a next-gen ggobi!
(4) Datameer takes this a step further: it has an app market.
(5) Some tools even send you realtime alerts when data for charts you’ve subscribed to have changed.
(6) I omitted Velocity – the ability to handle streaming data. I consider that a nice, but not a must-have feature for a visual exploration tool. Having said that, I do think the ability to handle realtime updates is essential when you share your work with others. See (5).
(7) When working with truly massive data sets it’s natural to have some latency. Rather than having users idle while waiting, visual analysis tools should support multiple tabs or workspaces. Most database query tools have this feature: you can work on other queries while a query is still running.
(8) A recent conversation with Sara Alspaugh inspired this feature.