Applications of data science and machine learning in financial services

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.
  • The current state of data science in financial services in both the U.S. and China.
  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Continue reading “Applications of data science and machine learning in financial services”

Becoming a machine learning company means investing in foundational technologies

[A version of this post appears on the O’Reilly Radar.]

Companies successfully adopt machine learning either by building on existing data products and services, or by modernizing existing models and algorithms.

In this post, I share slides and notes from a keynote I gave at the Strata Data Conference in London earlier this year. I will highlight the results of a recent survey on machine learning adoption, and along the way describe recent trends in data and machine learning (ML) within companies. This is a good time to assess enterprise activities, as there are many indications a number of companies are already beginning to use machine learning. For example, in a July 2018 survey that drew more than 11,000 respondents, we found strong engagement among companies: 51% stated they already had machine learning models in production.

With all the hype around AI, it can be tempting to jump into use cases involving data types with which you aren’t familiar. We found that companies that have successfully adopted machine learning do so either by building on existing data products and services, or by modernizing existing models and algorithms. Here are some typical ways organizations begin using machine learning:

  • Build upon existing analytics use cases: e.g., one can use existing data sources for business intelligence and analytics, and use them in an ML application.
  • Modernize existing applications such as recommenders, search ranking, time series forecasting, etc.
  • Use ML to unlock new data types—e.g., images, audio, video.
  • Tackle completely new use cases and applications.

Continue reading “Becoming a machine learning company means investing in foundational technologies”

How AI and machine learning are improving customer experience

[A version of this post appears on the O’Reilly Radar.]

From data quality to personalization, to customer acquisition and retention, and beyond, AI and ML will shape the customer experience of the future.

By Ben Lorica and Mike Loukides.

What can artificial intelligence (AI) and machine learning (ML) do to improve customer experience? AI and ML already have been intimately involved in online shopping since, well, the beginning of online shopping. You can’t use Amazon or any other shopping service without getting recommendations, which are often personalized based on the vendor’s understanding of your traits: your purchase history, your browsing history, and possibly much more. Amazon and other online businesses would love to invent a digital version of the (possibly mythical) sales person who knows you and your tastes, and can unerringly guide you to products you will enjoy.

Everything begins with better data

To make that vision a reality, we need to start with some heavy lifting on the back end. Who are your customers? Do you really know who they are? All customers leave behind a data trail, but that data trail is a series of fragments, and it’s hard to relate those fragments to each other. If one customer has multiple accounts, can you tell? If a customer has separate accounts for business and personal use, can you link them? And if an organization uses many different names (we remember a presentation in which someone talked of the hundreds of names—literally—that resolved to IBM), can you discover the single organization responsible for them? Customer experience starts with knowing exactly who your customers are and how they’re related. Scrubbing your customer lists to eliminate duplicates is called entity resolution; it used to be the domain of large companies that could afford substantial data teams. We’re now seeing the democratization of entity resolution: there are now startups that provide entity resolution software and services that are appropriate for small to mid-sized organizations.

Once you’ve found out who your customers are, you have to ask how well you know them. Getting a holistic view of a customer’s activities is central to understanding their needs. What data do you have about them, and how do you use it? ML and AI are now being used as tools in data gathering: in processing the data streams that come from sensors, apps, and other sources. Gathering customer data can be intrusive and ethically questionable; as you build your understanding of your customers, make sure you have their consent and that you aren’t compromising their privacy.
Continue reading “How AI and machine learning are improving customer experience”

Real-time entity resolution made accessible

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jeff Jonas on the evolution of entity resolution technologies.

In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.).

ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution.

We had a great conversation spanning many topics including:
Continue reading “Real-time entity resolution made accessible”

Why companies are in need of data lineage solutions

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.
Continue reading “Why companies are in need of data lineage solutions”

What data scientists and data engineers can do with current generation serverless technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.

We had a great conversation spanning many topics, including:

  • A short history of cloud computing.
  • The fundamental differences between serverless and conventional cloud computing.
  • The reasons serverless—specifically AWS Lambda—took off so quickly.
  • What can data scientists and data engineers do with the current generation serverless offerings.
  • What is missing from serverless today and what should users expect in the near future.

Related resources:

Specialized tools for machine learning development and model governance are becoming essential

[A version of this post appears on the O’Reilly Radar.]

Why companies are turning to specialized machine learning tools like MLflow.

By Ben Lorica and Mike Loukides.

A few years ago, we started publishing articles (see “Related resources” at the end of this post) on the challenges facing data teams as they start taking on more machine learning (ML) projects. Along the way, we described a new job role and title—machine learning engineer—focused on creating data products and making data science work in production, a role that was beginning to emerge in the San Francisco Bay Area two years ago. At that time, there weren’t any popular tools aimed at solving the problems facing teams tasked with putting machine learning into practice.

About 10 months ago, Databricks announced MLflow, a new open source project for managing machine learning development (full disclosure: Ben Lorica is an advisor to Databricks). We thought that given the lack of clear open source alternatives, MLflow had a decent chance of gaining traction, and this has proven to be the case. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow.

So, why is this new open source project resonating with data scientists and machine learning engineers? Recall the following key attributes of a machine learning project:
Continue reading “Specialized tools for machine learning development and model governance are becoming essential”