What I use for data visualization

[A version of this post appears on the O’Reilly Data blog.]

Depending on the nature of the problem, data size, and deliverable, I still draw upon an array of tools for data visualization. As I survey the Design track at next month’s Strata conference, I see creators and power users of visualization tools that many data scientists have come to rely on. Several pioneers will lead sessions on (new) tools for creating static and interactive charts, against small and massive data sets.

The Grammar of Graphics
To this day, I find R (specifically ggplot2) to be a tool I turn to for producing static visualizations. Even the simplest charts allow me to quickly spot data problems and anomalies, and a tool like ggplot2 can accomplish a lot in very few lines of code. Charts produced by ggplot2 look much nicer than simple R plots and once you get past the initial learning curve, they are easy to fine-tune and customize.

Hadley Wickham1, the creator of ggplot2, is speaking on two new domain specific languages (ggvis and dplyr) that make it easy for R users to declaratively create interactive web graphics. As Hadley describes it, ggvis is interactive Grammar of Graphics for R. As more data scientists turn to interactive visualizations that can be shared through web browsers, ggvis is the natural next tool for ggplot2 users.

Leland Wilkinson, the primary author of The Grammar of Graphics2, will also be at Strata to lead a tutorial on an interesting expert system that lets machine-learning techniques be accessible to business users. Leland’s work has influenced many other visualization tools including Polaris (from the Stanford team that founded Tableau), Bokeh, and ggbio (for genomics data). Effective visualization techniques will be an important component of his Strata tutorial.

d3 and Javascript
For interactive web visualizations, I previously turned to Google Charts and protovis. But I (and many other protovis fans) began migrating over to d3.js when it was announced in 2011. Since then I’ve used its versatility and power to create standard static charts and highly customized, interactive visualizations. If you’re new to d3, Scott Murray is leading an introductory tutorial at Strata that I highly recommend (Scott is a popular instructor and author).

Both protovis and d3 originated out of Jeff Heer’s lab at Stanford (IDL is now at the University of Washington). Jeff3 and fellow Trifacta co-founder, Joe Hellerstein, are leading a Strata tutorial on data wrangling. They will go over new tools and techniques for assessing and transforming dirty data (a major bottleneck for most data scientists). Attendees get a chance to learn from two of the best practitioners/teachers in the data space.

As I noted in a recent post, I’m currently using Python and Scala as my general purpose programming languages. I’m pleased to see the PyData community embrace visualization tools from other languages. Brian Granger, one of the leaders of the IPython community, is giving a talk on how IPython’s architecture allows users to leverage tools like d3 from within IPython notebooks.

Processing and Superconductor
Before I started using protovis, I used Processing to create sketches and simple interactive visualizations. I’ve moved away from Processing in recent years but a quick glance at the available libraries makes me want to try it out again. Ben Fry, the co-creator of Processing, is giving a talk and keynote presentation at Strata.

If you need to create interactive web visualizations on large data sets, check out Superconductor – a new open source project that originated from UC Berkeley’s Par Lab. It leverages high-level, simple, domain specific languages that automatically find and exploit parallelism4. Here’s an example of 100,000 time-series data points, spread across hundreds of line graphs, with Javascript controls for real-time zooming and panning:

Leo Meyerovich5, the creator of Superconductor, will give an overview of this promising new tool at Strata.

Related Content:

(1) Hadley is holding an office hour at Strata on Thursday, February 13th.
(2) ggplot2 was inspired by The Grammar of Graphics proposed by Wilkinson, Anand, and Grossman.
(3) Jeff is holding an office hour at Strata on Wednesday, February 12th.
(4) From Superconductor’s GitHub repo: “Superconductor automates the use of web workers (multicore), WebCL (GPU), and WebGL (GPU).”
(5) Leo is holding an office hour at Strata on Wednesday, February 12th

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s