Transforming organizations through analytics centers of excellence

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Carme Artigas on helping enterprises transform themselves with big data tools and technologies.

In this episode of the Data Show, I spoke with Carme Artigas, co-founder and CEO of Synergic Partners (a Telefonica company). As more companies adopt big data technologies and techniques, it’s useful to remember that the end goal is to extract information and insight. In fact, as with any collection of tools and technologies, the main challenge is identifying and prioritizing use cases.

As Artigas describes, one can categorize use cases for big data into the following types:

  • Improve decision-making or operational efficiency
  • Generate new or additional revenue
  • Predict or prevent fraud (forecasting or minimizing risks)

Artigas has spent many years helping large organizations develop best practices for how to use data and analytics. We discussed some of the key challenges faced by organizations that wish to adopt big data technologies, centers of excellence for analytics, and AI in the enterprise.
Continue reading “Transforming organizations through analytics centers of excellence”

The state of machine learning in Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.
  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.
  • The plan to make it easier to integrate advanced analytics libraries that aren’t “textbook machine learning,” like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.
  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).
  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning — lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources: