The current state of Apache Kafka

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Neha Narkhede on data integration, microservices, and Kafka’s roadmap.

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “The Age of Machine Learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Here are some highlights from our conversation:

The first engineering project that made use of Apache Kafka

If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place?

So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka.
Continue reading “The current state of Apache Kafka”

Building a natural language processing library for Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: David Talby on a new NLP library for Spark, and why model development starts after a model gets deployed to production.

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.

In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.

Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

Here are some highlights from our conversation:

The state of NLP in Spark

Here are your two choices today. Either you want to leverage all of the performance and optimization that Spark gives you, which means you want to stay basically within the JVM, and you want to use a Java-based library. In which case, you have options that include OpenNLP, which is open source, or Stanford NLP, which requires licensing in order to use in a commercial product. These are older and more academically oriented libraries. So, they have limitations in performance and what they do.
Continue reading “Building a natural language processing library for Apache Spark”

How companies can navigate the age of machine learning

[A version of this post appears on the O’Reilly Radar.]

To become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.

Over the last few years, the data community has focused on gathering and collecting data, building infrastructure for that purpose, and using data to improve decision-making. We are now seeing a surge in interest in advanced analytics and machine learning across many industry verticals.

In this post, I share slides and notes from a talk I gave this past September at Strata Data NYC offering suggestions to companies interested in adding machine learning capabilities. The information stems from conversations with practitioners, researchers, and entrepreneurs at the forefront of applying machine learning across many different problem domains.
Continue reading “How companies can navigate the age of machine learning”

How Ray makes continuous learning accessible and easy to scale

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Robert Nishihara and Philipp Moritz on a new framework for reinforcement learning and AI applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on StitcherTuneIniTunesSoundCloudRSS.

In this episode of the Data Show, I spoke with Robert Nishihara and Philipp Moritz, graduate students at UC Berkeley and members of RISE Lab. I wanted to get an update on Ray, an open source distributed execution framework that makes it easy for machine learning engineers and data scientists to scale reinforcement learning and other related continuous learning algorithms. Many AI applications involve an agent (for example a robot or a self-driving car) interacting with an environment. In such a scenario, an agent will need to continuously learn the right course of action to take for a specific state of the environment.

What do you need in order to build large-scale continuous learning applications? You need a framework with low-latency response times, one that is able to run massive numbers of simulations quickly (agents need to be able explore states within an environment), and supports heterogeneous computation graphs. Ray is a new execution framework written in C++ that contains these key ingredients. In addition, Ray is accessible via Python (and Jupyter Notebooks), and comes with many of the standard reinforcement learning and related continuous learning algorithms that users can easily call.

As Nishihara and Moritz point out, frameworks like Ray are also useful for common applications such as dialog systems, text mining, and machine translation. Here are some highlights from our conversation:

Tools for reinforcement learning

Ray is something we’ve been building that’s motivated by our own research in machine learning and reinforcement learning. If you look at what researchers who are interested in reinforcement learning are doing, they’re largely ignoring the existing systems out there and building their own custom frameworks or custom systems for every new application that they work on.

… For reinforcement learning, you need to be able to share data very efficiently, without copying it between multiple processes on the same machine, you need to be able to avoid expensive serialization and deserialization, and you need to be able to create a task and get the result back in milliseconds instead of hundreds of milliseconds. So, there are a lot of little details that come up.
Continue reading “How Ray makes continuous learning accessible and easy to scale”

A scalable time-series database that supports SQL

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Freedman on TimescaleDB and scaling SQL for time-series.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.

Here are some highlights from our conversation:
Continue reading “A scalable time-series database that supports SQL”

What are machine learning engineers?

[A version of this appears on the O’Reilly Radar.]

A new role focused on creating data products and making data science work in production.

by Ben Lorica and Mike Loukides

We’ve been talking about data science and data scientists for a decade now. While there’s always been some debate over what “data scientist” means, we’ve reached the point where many universities, online academies, and bootcamps offer data science programs: master’s degrees, certifications, you name it. The world was a simpler place when we only had statistics. But simplicity isn’t always healthy, and the diversity of data science programs demonstrates nothing if not the demand for data scientists.

As the field of data science has developed, any number of poorly distinguished specialties have emerged. Companies use the terms “data scientist” and “data science team” to describe a variety of roles, including:

  • individuals who carry out ad hoc analysis and reporting (including BI and business analytics)
  • people who are responsible for statistical analysis and modeling, which, in many cases, involves formal experiments and tests
  • machine learning modelers who increasingly develop prototypes using notebooks

And that listing doesn’t include the people DJ Patil and Jeff Hammerbacher were thinking of when they coined the term “data scientist”: the people who are building products from data. These data scientists are most similar to the machine learning modelers, except that they’re building something: they’re product-centric, rather than researchers. They typically work across large portions of data products. Whatever the role, data scientists aren’t just statisticians; they frequently have doctorates in the sciences, with a lot of practical experience working with data at scale. They are almost always strong programmers, not just specialists in R or some other statistical package. They understand data ingestion, data cleaning, prototyping, bringing prototypes to production, product design, setting up and managing data infrastructure, and much more. In practice, they turn out to be the archetypal Silicon Valley “unicorns”: rare and very hard to hire.

What’s important isn’t that we have well-defined specialties; in a thriving field, there will always be huge gray areas. What made “data science” so powerful was the realization that there was more to data than actuarial statistics, business intelligence, and data warehousing. Breaking down the silos that separated data people from the rest of the organization—software development, marketing, management, HR—is what made data science distinct. Its core concept was that data was applicable to everything. The data scientist’s mandate was to gather, and put to use, all the data. No department went untouched.
Continue reading “What are machine learning engineers?”

Architecting and building end-to-end streaming applications

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Karthik Ramasamy on Heron, DistributedLog, and designing real-time applications.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Karthik Ramasamy, adjunct faculty member at UC Berkeley, former engineering manager at Twitter, and co-founder of Streamlio. Ramasamy managed the team that built Heron, an open source, distributed stream processing engine, compatible with Apache Storm.  While Ramasamy has seen firsthand what it takes to build and deploy large-scale distributed systems (within Twitter, he worked closely with the team that built DistributedLog), he is first and foremost interested in designing and building end-to-end applications. As someone who organizes many conferences, I’m all too familiar with the vast array of popular big data frameworks available. But, I also know that engineers and architects are most interested in content and material that helps them cut through the options and decisions.

Ramasamy and I discussed the importance of designing systems that can be combined to produce end-to-end applications with the requisite characteristics and guarantees.

Here are some highlights from our conversation:

Moving from Apache Storm to Heron

A major consideration was that we had to fundamentally change a lot of things. So, the team weighed the cost: should we go with an existing code base or develop a new code base? We thought that even if we developed a new code base, we would be able to get it done very quickly and the team was excited about it. That’s what we did and we got the first version of Heron done in eight or nine months.

I think it was one of the quickest transitions that ever happened in the history of Twitter. Apache Storm was hit by a lot of failures. There was a strong incentive to move to a new system. Once we proved the new system was highly reliable, we created a compelling value for the engineering teams. We also made it very painless for people to move. All they had to do was recompile a job and launch it. So, when you make a system like that, then people are just going to say, ‘let me give it a shot.’ They just compile it, launch it, then they say, ‘for a week, my job has been running without any issues; that’s good, I’m moving.’ So, we got migration done, from Storm to Heron, in less than six months. All the teams cooperated with us, and it was just amazing that we were able to get it done in less than six months. And we provided them a level of reliability that they never had with Storm.

Continue reading “Architecting and building end-to-end streaming applications”