MLbase: Scalable Machine-learning made accessible

[Cross-posted on the O’Reilly Strata blog.]

In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.

To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project0 out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime).

Continue reading

2012 Revenue of some Big Data companies

The chart below is from Wikibon’s estimates1 of the 2012 revenue of some Big Data companies. Using d3 I drew a chart that shows 2012 revenue in millions, as well as the share of revenue derived from services, for a few select/startup companies:

        (Click HERE to enlarge)

  • The Big 3 Hadoop Vendors (Cloudera/MapR/Hortonworks): Combined revenue was $102M, with $61.6M coming from services. In particular Hortonworks relies exclusively (per Wikibon’s estimates) on revenue from Services. In comparison the percentage share of Services for Cloudera and MapR were 53% & 49%.
  • Business Intelligence (QlikTech/Tableau/Jaspersoft/Pentaho/Datameer/SiSense): Combined revenue was $531.2M, dwarfing the Hadoop vendors2. With market leaders QlikTech and Tableau unable to scale to massive data sets, startups like Datameer and Platfora are generating interest from the many companies already invested in Hadoop and HDFS.
  • Analytics (Splunk/Palantir/Revolution Analytics/Digital Reasoning): Combined revenue was $287M, with $94.2M coming from services.

  • (1) Methodology: “Regarding methodology, the Big Data market size, forecast, and related market-share data was determined based on extensive research of public revenue figures, media reports, interviews with vendors, venture capitalists and resellers regarding customer pipelines, product roadmaps, and feedback from the Wikibon community of IT practitioners. Many vendors were not able or willing to provide exact figures regarding their Big Data revenue, and because many of the vendors are privately held it was necessary for Wikibon to triangulate many types of information to determine our final figures. We also held extensive discussions with former employees of Big Data companies to further calibrate our models.”
    (2) Tableau alone generated more than the combined total revenue of the Big 3 Hadoop vendors.