Building a natural language processing library for Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: David Talby on a new NLP library for Spark, and why model development starts after a model gets deployed to production.

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.

In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.

Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

Here are some highlights from our conversation:

The state of NLP in Spark

Here are your two choices today. Either you want to leverage all of the performance and optimization that Spark gives you, which means you want to stay basically within the JVM, and you want to use a Java-based library. In which case, you have options that include OpenNLP, which is open source, or Stanford NLP, which requires licensing in order to use in a commercial product. These are older and more academically oriented libraries. So, they have limitations in performance and what they do.
Continue reading “Building a natural language processing library for Apache Spark”

Machine intelligence for content distribution, logistics, smarter cities, and more

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Rhea Liu on technology trends in China.

In this episode of the Data Show, I spoke with Rhea Liu, analyst at China Tech Insights, a new research firm that is part of Tencent’s Online Media Group. If there’s one place where AI and machine learning are discussed even more than the San Francisco Bay Area, that would be China. Each time I go to China, there are new applications that weren’t widely available just the year before. This year, it was impossible to miss bike sharing, mobile payments seemed to be accepted everywhere, and people kept pointing out nascent applications of computer vision (facial recognition) to identity management and retail (unmanned stores).

I wanted to consult local market researchers to help make sense of some of the things I’ve been observing from afar. Liu and her colleagues have put out a series of interesting reports highlighting some of these important trends. They also have an annual report—Trends & Predictions for China’s Tech Industry in 2018—that Liu will discuss in her keynote and talk at Strata Data Singapore in December.

Here are some highlights from our conversation:
Continue reading “Machine intelligence for content distribution, logistics, smarter cities, and more”

How companies can navigate the age of machine learning

[A version of this post appears on the O’Reilly Radar.]

To become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.

Over the last few years, the data community has focused on gathering and collecting data, building infrastructure for that purpose, and using data to improve decision-making. We are now seeing a surge in interest in advanced analytics and machine learning across many industry verticals.

In this post, I share slides and notes from a talk I gave this past September at Strata Data NYC offering suggestions to companies interested in adding machine learning capabilities. The information stems from conversations with practitioners, researchers, and entrepreneurs at the forefront of applying machine learning across many different problem domains.
Continue reading “How companies can navigate the age of machine learning”

Vehicle-to-vehicle communication networks can help fuel smart cities

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Bruno Fernandez-Ruiz on the importance of building the ground control center of the future.

In this episode of the Data Show, I spoke with Bruno Fernandez-Ruiz, co-founder and CTO of Nexar. We first met when he was leading Yahoo! technical teams charged with delivering a variety of large-scale, real-time data products. His new company is helping build out critical infrastructure for the emerging transportation sector.

While some question whether V2X communication is necessary to get to fully autonomous vehicles, Nexar is already paving the way by demonstrating how a vehicle-to-vehicle (V2V) communication network can be built efficiently. As Fernandez-Ruiz points out, there are many applications for such a V2V network (safety being the most obvious one). I’m particularly fascinated by what such a network, and the accompanying data, opens up for future, smarter cities. As I pointed out in a post on continuous learning, simulations are an important component of training AI applications. It seems reasonable to expect that the data sets collected by V2V networks will be useful for smart city planners of the future.

Continue reading “Vehicle-to-vehicle communication networks can help fuel smart cities”

Transforming organizations through analytics centers of excellence

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Carme Artigas on helping enterprises transform themselves with big data tools and technologies.

In this episode of the Data Show, I spoke with Carme Artigas, co-founder and CEO of Synergic Partners (a Telefonica company). As more companies adopt big data technologies and techniques, it’s useful to remember that the end goal is to extract information and insight. In fact, as with any collection of tools and technologies, the main challenge is identifying and prioritizing use cases.

As Artigas describes, one can categorize use cases for big data into the following types:

  • Improve decision-making or operational efficiency
  • Generate new or additional revenue
  • Predict or prevent fraud (forecasting or minimizing risks)

Artigas has spent many years helping large organizations develop best practices for how to use data and analytics. We discussed some of the key challenges faced by organizations that wish to adopt big data technologies, centers of excellence for analytics, and AI in the enterprise.
Continue reading “Transforming organizations through analytics centers of excellence”

The state of machine learning in Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.

In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.

We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.

Here is a partial list of the items we discussed:

  • The current state of machine learning in Spark.
  • Given that a lot of innovation has taken place outside the Spark community (e.g., scikit-learn, TensorFlow, XGBoost), we discussed the role of Spark ML moving forward.
  • The plan to make it easier to integrate advanced analytics libraries that aren’t “textbook machine learning,” like NLP, time series analysis, and graph analysis into Spark and Spark ML pipelines.
  • Some upcoming projects from Berkeley and Stanford that target AI applications (including newer systems that provide lower latency, higher throughput).
  • Recent Berkeley and Stanford projects that address two key bottlenecks in machine learning — lack of training data, and deploying and monitoring models in production.

[Full disclosure: I am an advisor to Databricks.]

Related resources:

Effective mechanisms for searching the space of machine learning algorithms

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Kenneth Stanley on neuroevolution and other principled ways of exploring the world without an objective.

In this episode of the Data Show, I spoke with Ken Stanley, founding member of Uber AI Labs and associate professor at the University of Central Florida. Stanley is an AI researcher and a leading pioneer in the field of neuroevolution—a method for evolving and learning neural networks through evolutionary algorithms. In a recent survey article, Stanley went through the history of neuroevolution and listed recent developments, including its applications to reinforcement learning problems.

Stanley is also the co-author of a book entitled Why Greatness Cannot Be Planned: The Myth of the Objective—a book I’ve been recommending to anyone interested in innovation, public policy, and management. Inspired by Stanley’s research in neuroevolution (into topics like novelty search and open endedness), the book is filled with examples of how notions first uncovered in the field of AI can be applied to many other disciplines and domains.

The book closes with a case study that hits closer to home—the current state of research in AI. One can think of machine learning and AI as a search for ever better algorithms and models. Stanley points out that gatekeepers (editors of research journals, conference organizers, and others) impose two objectives that researchers must meet before their work gets accepted or disseminated: (1) empirical: their work should beat incumbent methods on some benchmark task, and (2) theoretical: proposed new algorithms are better if they can be proven to have desirable properties. Stanley argues this means that interesting work (“stepping stones”) that fail to meet either of these criteria fall by the wayside, preventing other researchers from building on potentially interesting but incomplete ideas.
Continue reading “Effective mechanisms for searching the space of machine learning algorithms”