[A version of this post appears on the O’Reilly Strata blog.]
Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data scientists are encouraged to cast their nets wide, and investigate alternative (unstructured) data sources. The general perception is that data wrangling is the province of programmers and data scientists. Spend time around Excel users and you’ll learn that they do quite a bit of data wrangling too!
In my work I tend to write scripts and small programs to do data wrangling. That usually means some combination1 of SQL, Python, and Spark2. I’ve played with Google Refine (now called OpenRefine) in the past, but I found the UI hard to get used to. Part of the problem may have been that I didn’t use the tool often3 enough to become comfortable.
For most users data wrangling still tends to mean a series of steps that usually involves different tools (e.g., you often need to draw charts to spot outliers and anomalies). As I’ve pointed out in previous posts, workflows that involve many different tools require a lot of context-switching, which in turn affects productivity and impedes reproducability.
“We are washing our data at the side of the river on stones. We are really in the early, early ages of productivity technology in data science.”
Joe Hellerstein (Strata-NYC 2012), co-founder and CEO of Trifacta
Startups to the rescue
If data scientists spend a majority of time on data wrangling, the right set of tools can significantly boost their productivity (and thus their companies bottom lines). In addition, by empowering business users to do many routine data wrangling tasks, they are able to “play” with many more data sources without overburdening their IT departments.
Visual Analysis startups Platfora, Datameer, and ClearStory all offer some tools for data wrangling. ClearStory offers a catalog of external data sets that users can easily combine (“harmonize”) with their own data. SiSense has a set of data transformation functions, along with tools for combining data sets (“join recommendations”).
While Visual Analysis tools are recognizing the need to simplify data wrangling, there are many data sets that are beyond the scope of their simple data preparation4 utilities. Luckily some startups are focused on data preparation and data wrangling. With Paxata’s data preparation tools, analysts are able to quickly focus on analytics (instead of data wrangling). In particular users of Paxata5 are able to do many of the data preparation tasks they previously asked their IT colleagues to do. Dataiku6 offers a product called Data Science Studio that comes with statistical analysis tools for data cleansing. Trifacta is a SF-based startup that is building tools to simplify data transformation and preparation. The founders of Trifacta are the academic team behind the highly regarded DataWrangler project.
- (Upcoming tutorial at Strata Santa Clara 2014) Data Transformation: Skills of the Agile Data Wrangler
- Data Analysis: Just one component of the Data Science workflow
- Data Jujitsu: The Art of Turning Data into Product
(1) I rarely use stats tools like R for data wrangling, although years ago I remember sometimes using SAS (proc sql and the data step).
(2) Data processing and wrangling is one of the common uses of Spark.
(3) Your need a certain number or “reps” (repetitions) to begin appreciating a tool like OpenRefine.
(4) Preparation prior to data analysis.
(5) In the case of Paxata this includes filter, transform, cluster, and merge.
(6) Update (11/24/2013): Hat tip to Sam Bessalah.