How machine learning impacts information security

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Andrew Burt on the need to modernize data protection tools and strategies.

In this episode of the Data Show, I spoke with Andrew Burt, chief privacy officer and legal engineer at Immuta, a company building data management tools tuned for data science. Burt and cybersecurity pioneer Daniel Geer recently released a must-read white paper (“Flat Light”) that provides a great framework for how to think about information security in the age of big data and AI. They list important changes to the information landscape and offer suggestions on how to alleviate some of the new risks introduced by the rise of machine learning and AI.

We discussed their new white paper, cybersecurity (Burt was previously a special advisor at the FBI), and an exciting new Strata Data tutorial that Burt will be co-teaching in March.
Continue reading “How machine learning impacts information security”

7 data trends on our radar

[A version of this post appears on the O’Reilly Radar.]

From infrastructure to tools to training, here’s what’s ahead for data.

Whether you’re a business leader or a practitioner, here are key data trends to watch and explore in the months ahead.

Increasing focus on building data culture, organization, and training

In a recent O’Reilly survey, we found that the skills gap remains one of the key challenges holding back the adoption of machine learning. The demand for data skills (“the sexiest job of the 21st century”) hasn’t dissipated. LinkedIn recently found that demand for data scientists in the US is “off the charts,” and our survey indicated that the demand for data scientists and data engineers is strong not just in the US but globally.

With the average shelf life of a skill today at less than five years and the cost to replace an employee estimated at between six and nine months of the position’s salary, there is increasing pressure on tech leaders to retain and upskill rather than replace their employees in order to keep data projects (such as machine learning implementations) on track. We are also seeing more training programs aimed at executives and decision makers, who need to understand how these new ML technologies can impact their current operations and products.
Continue reading “7 data trends on our radar”

Deep automation in machine learning

[A version of this post appears on the O’Reilly Radar.]

We need to do more than automate model building with autoML; we need to automate tasks at every stage of the data pipeline.

By Ben Lorica and Mike Loukides

In a previous post, we talked about applications of machine learning (ML) to software development, which included a tour through sample tools in data science and for managing data infrastructure. Since that time, Andrej Karpathy has made some more predictions about the fate of software development: he envisions a Software 2.0, in which the nature of software development has fundamentally changed. Humans no longer implement code that solves business problems; instead, they define desired behaviors and train algorithms to solve their problems. As he writes, “a neural network is a better piece of code than anything you or I can come up with in a large fraction of valuable verticals.” We won’t be writing code to optimize scheduling in a manufacturing plant; we’ll be training ML algorithms to find optimum performance based on historical data.

If humans are no longer needed to write enterprise applications, what do we do? Humans are still needed to write software, but that software is of a different type. Developers of Software 1.0 have a large body of tools to choose from: IDEs, CI/CD tools, automated testing tools, and so on. The tools for Software 2.0 are only starting to exist; one big task over the next two years is developing the IDEs for machine learning, plus other tools for data management, pipeline management, data cleaning, data provenance, and data lineage.
Continue reading “Deep automation in machine learning”

Building tools for enterprise data science

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Vitaly Gordon on the rise of automation tools in data science.

In this episode of the Data Show, I spoke with Vitaly Gordon, VP of data science and engineering at Salesforce. As the use of machine learning becomes more widespread, we need tools that will allow data scientists to scale so they can tackle many more problems and help many more people. We need automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection and hyperparameter tuning, as well as monitoring.

I wanted the perspective of someone who is already faced with having to support many models in production. The proliferation of models is still a theoretical consideration for many data science teams, but Gordon and his colleagues at Salesforce already support hundreds of thousands of customers who need custom models built on custom data. They recently took their learnings public and open sourced TransmogrifAI, a library for automated machine learning for structured data, which sits on top of Apache Spark.

Here are some highlights from our conversation:
Continue reading “Building tools for enterprise data science”

Managing risk in machine learning

[A version of this post appears on the O’Reilly Radar.]

Considerations for a world where ML models are becoming mission critical.

In this post, I share slides and notes from a keynote I gave at the Strata Data Conference in New York last September. As the data community begins to deploy more machine learning (ML) models, I wanted to review some important considerations.

Let’s begin by looking at the state of adoption. We recently conducted a surveywhich garnered more than 11,000 respondents—our main goal was to ascertain how enterprises were using machine learning. One of the things we learned was that many companies are still in the early stages of deploying machine learning (ML):

As far as reasons for companies holding back, we found from a survey we conducted earlier this year that companies cited lack of skilled people, a “skills gap,” as the main challenge holding back adoption.

Interest on the part of companies means the demand side for “machine learning talent” is healthy. Developers have taken notice and are beginning to learn about ML. In our own online training platform (which has more than 2.1 million users), we’re finding strong interest in machine learning topics. Below are the top search topics on our training platform:
Continue reading “Managing risk in machine learning”

Lessons learned while helping enterprises adopt machine learning

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Francesca Lazzeri and Jaya Mathew on digital transformation, culture and organization, and the team data science process.

In this episode of the Data Show, I spoke with Francesca Lazzeri, an AI and machine learning scientist at Microsoft, and her colleague Jaya Mathew, a senior data scientist at Microsoft. We conducted a couple of surveys this year—“How Companies Are Putting AI to Work Through Deep Learning” and “The State of Machine Learning Adoption in the Enterprise” — and we found that while many companies are still in the early stages of machine learning adoption, there’s considerable interest in moving forward with projects in the near future. Lazzeri and Mathew spend a considerable amount of time interacting with companies that are beginning to use machine learning and have experiences that span many different industries and applications. I wanted to learn some of the processes and tools they use when they assist companies in beginning their machine learning journeys.

Here are some highlights from our conversation:

Team data science process

Francesca Lazzeri: The Data Science Process is a framework that we try to apply in our projects. Everything begins with a business problem, so external customers come to us with a business problem or a process they want to optimize. We work with them to translate these into realistic questions, into what we call data science questions. And then we move to the data portion: what are the different relevant data sources, is the data internal or external? After that, you try to define the data pipeline. We start with the core part of the data science process—that is, data cleaning—and proceed to feature engineering, model building, and model deployment and management.
Continue reading “Lessons learned while helping enterprises adopt machine learning”

Machine learning on encrypted data

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Alon Kaufman on the interplay between machine learning, encryption, and security.

In this episode of the Data Show, I spoke with Alon Kaufman, CEO and co-founder of Duality Technologies, a startup building tools that will allow companies to apply analytics and machine learning to encrypted data. In a recent talk, I described the importance of data, various methods for estimating the value of data, and emerging tools for incentivizing data sharing across organizations. As I noted, the main motivation for improving data liquidity is the growing importance of machine learning. We’re all familiar with the importance of data security and privacy, but probably not as many people are aware of the emerging set of tools at the intersection of machine learning and security. Kaufman and his stellar roster of co-founders are doing some of the most interesting work in this area.

Here are some highlights from our conversation:

Running machine learning models on encrypted data

Four or five years ago, techniques for running machine learning models on data while it’s encrypted were being discussed in the academic world. We did a few trials of this and although the results were fascinating, it still wasn’t practical.

… There have been big breakthroughs that have led to it becoming feasible. A few years ago, it was more theoretical. Now it’s becoming feasible. This is the right time to build a company. Not only because of the technology feasibility but definitely because of the need in the market.

Continue reading “Machine learning on encrypted data”