Crowdsourcing Feature discovery

More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists

[A version of this post appears on the O’Reilly Data blog and Forbes.]

Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans1 take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).

Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies”.

CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate2 features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.

Continue reading

Instrumenting collaboration tools used in data projects

Built-in audit trails can be useful for reproducing and debugging complex data analysis projects

[A version of this post appears on the O’Reilly Data blog.]

As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover, companies are beginning to treat analytics as mission-critical software and have real-time dashboards to track model performance.

Once a model is deemed to be underperforming or misbehaving, diagnostic tools are needed to help determine appropriate fixes. It could well be models need to be revisited and updated, but there are instances when underlying data sources1 and data pipelines are what need to be fixed. Beyond the formal systems put in place specifically for monitoring analytic products, tools for reproducing data science workflows could come in handy.

Version control systems are useful, but appeal primarily to developers. The recent wave of data products come with collaboration features that target a broader user base. Properly instrumented, collaboration tools are also useful for reproducing and debugging complex data analysis projects. As an example, Alpine Data records all the actions made while working on a data project: a screen displays all recent “actions and changes” and team members can choose to leave comments or questions.

If you’re a tool builder charged with baking in collaboration, consider how best to expose activity logs as well. Properly crafted “audit trails” can be very useful for uncovering and fixing problems that arise once a model gets deployed in production.

Alpine Chorus: audit trail

Related Content:

(1) Models can be on the receiving end of bad data or the victim of attacks from adversaries.

2013 Revenue of some startup companies

The chart below is from Wikibon’s estimates1 of the 2013 revenue2 of some Big Data companies. Using d3 I drew a chart that shows 2013 revenue (in millions) from Big Data products and services, as well as the share of revenue derived from services, for a few select/startup companies:

Wikibon: 2013 (Big Data) revenue of some startups
(Click HERE to enlarge)

  • The Big 3 Hadoop Vendors (Cloudera/MapR/Hortonworks): Combined revenue was $163M, up from $102M in 2012: $57.2M came from services, compared to $61.6M in 2012.
  • Business Intelligence (including Qlik, Tableau, GoodData, Jaspersoft, Pentaho, Datameer, SiSense, and Actuate): Wikibon parses out what share of revenue comes from Big Data products and services. The total 2013 revenue of Tableau ($204) and Qlik ($467) still dwarf the Hadoop vendors. However the portion that Wikibon attributes to Big Data products and services were much smaller ($33M and $28M respectively).
  • Analytics (Splunk/Palantir/Revolution Analytics/Alteryx): Combined revenue was $761M, with $299.8M coming from services.
  • [Revenue estimates from earlier years can be found here.]

    (1) Methodology: “Regarding methodology, the Big Data market size, forecast, and related market-share data was determined based on extensive research of public revenue figures, media reports, interviews with vendors, venture capitalists and resellers regarding customer pipelines, product roadmaps, and feedback from the Wikibon community of IT practitioners. Many vendors were not able or willing to provide exact figures regarding their Big Data revenue, and because many of the vendors are privately held, Wikibon had to triangulate many types of information to determine its final figures. We also held extensive discussions with former employees of Big Data companies to further calibrate our models.Information types used to estimate revenue of private Big Data vendors included supply-side data collection, number of employees, number of customers, size of average customer engagement, amount of venture capital raised, and age of vendor.”

    (2) Having spoken to some of the companies mentioned in this post, I think that while the above revenue estimates aren’t 100% accurate, they’re in the “general ballpark”.

    Interface Languages and Feature Discovery

    It’s easier to “discover” features with tools that have broad coverage of the data science workflow

    [A version of this post appears on the O’Reilly Data blog and Forbes.]

    Here are a few more observations based on conversations I had during the just concluded Strata Santa Clara conference.

    Interface languages: Python, R, SQL (and Scala)
    This is a great time to be a data scientist or data engineer who relies on Python or R. For starters there are developer tools that simplify setup, package installation, and provide user interfaces designed to boost productivity (RStudio, Continuum, Enthought, Sense).

    Increasingly, Python and R users can write the same code and run it against many different execution1 engines. Over time the interface languages will remain constant but the execution engines will evolve or even get replaced. Specifically there are now many tools that target Python and R users interested in implementations of algorithms that scale to large data sets (e.g., GraphLab,, Adatao, H20, Skytree, Revolution R). Interfaces for popular engines like Hadoop and Apache Spark are also available – PySpark users can access algorithms in MLlib, SparkR users can use existing R packages.

    In addition many of these new frameworks go out of their way to ease the transition for Python and R users. “… bindings follow the Scikit-Learn conventions”, and as I noted in a recent post, with SFrames and Notebooks GraphLab, Inc. built components2 that are easy for Python users to learn.

    Continue reading