Alibaba ♥ Spark: Next time someone asks you if Apache Spark scales, point them to this recent post by Chinese e-commerce juggernaut Alibaba. What particularly caught my eye is the company’s heavy usage of GraphX, Spark’s library for graph analytics.
[Full disclosure: I’m an advisor to Databricks, a startup commercializing Apache Spark.]
Narrative Recommendations: When NarrativeScience started out, I thought of it primarily as a platform for generating short, factual stories for (hyperlocal) news services (a newer startup OnlyBoth seems to be focused on this, their working example being the use of “box scores” to cover “college” teams). More recently NarrativeScience has aimed its technology at the lucrative Business Intelligence market. Starting from structured data, NarrativeScience extracts and ranks facts, and weaves a narrative arc that analysts consume. The company retains the traditional elements of BI tools (tables, charts, dashboards) and supplements it with narrative summaries and recommendations. I like the concept of adding narrative outputs, and as with all relatively new technologies, the algorithms and accompanying user interfaces are bound to get better over time. The technology is largely “language” agnostic, but to reap maximum benefit it does need to be tuned for the specific domain you want to use it in.
With spreadsheets, you have to calculate. With visualizations, you have to interpret. With narratives, all you have to do is read.
Beyond building training sets for machine-learning, crowdsourcing is being used to enhance the results of machine-learning models: in active learning, humans take care of uncertain cases, models handle the routine ones. Active Learning is one of those topics that many data scientists have heard of, few have tried, and a small handful know how to do well. As data problems increase in complexity, I think active learning is a topic that many more data scientists need to familiarize themselves with.
Machine learning research is often not applied to real world situations. Often the improvements are small and the increased complexity is high, so except in special situations, industry doesn’t take advantage of advances in the academic literature.
Active learning is an example where research proposes a simple strategy that makes a huge difference and almost everyone applying machine learning to real world use cases is doing it or should be doing it. Active learning is the practice of taking cases where the model has low confidence, getting them labeled, and then using those labels as input data.
Webcast attendees will learn simple, practical ways to improve their models by cleaning up and tweaking the distribution of their training data. They will also learn about best practices from real world cases where active learning and data selection took models that were completely unusable in production to extremely effective.
Filtergraph and the power of visual exploration: A web-based tool for exploring high-dimensional data sets, Filtergraph came out of the lab of Astrophysicist Keivan Stassun. It has helped researchers make several interesting discoveries including a paper (that appeared in Nature) on a technique that improves estimates for the sizes of hundreds of exoplanets. For this particular discovery, Keivan tasked one of his students to play around with Filtergraph until she discovered “interesting patterns”. Her visual exploration led to an image that inspired the discoveries contained in the Nature paper.
RunMyCode: I was glad to see several sessions on the important topic of reproducibility of research projects and results (I’ve written about this topic from the data science perspective here and here). Beyond just sharing data sets, RunMyCode lets researchers share the data and computer programs they used to generate the results contained in their papers. Sharing both data and code used in research papers are important steps. (For complex setups, a tool like Vagrant can come in handy.) But to address the file drawer problem, access to data/code for “negative results” is also needed.
A network framework of cultural history: Scifoo alum Maximilian Schich pointed me to some of his group’s recent work on cultural migration in the Western world. I’ve seen Maximillian give preliminary talks on these results in the past (at Scifoo). He combines meticulous data collection, stunning visualizations, and network science to discover and quantify cultural patterns.
Long before the advent of “big data”, analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users this meant proficiency with Data Frames – objects used to store data matrices that can hold both numeric and categorical data. A data.frame is the data structure consumed by most R analytic libraries.
But not all data scientists use R, and nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists – either SQL-like expressions, or syntax similar to those used for R data.frame or pandas.DataFrame.
This webcast will introduce scikit-learn, an Open Source project for Machine Learning in Python and review some new features from the recent 0.15 release such as faster randomized ensemble of decision trees and optimization for the memory usage when working on multiple cores.
We will also review on-going work part of the 2014 edition of the Google Summer of Code: neural networks, extreme learning machines, improvements for linear models, and approximate nearest neighbor search with locality-sensitive hashing.