Applications of data science and machine learning in financial services

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jike Chong on the many exciting opportunities for data professionals in the U.S. and China.

In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.

We had a great conversation spanning many topics, including:

  • Potential applications of data science in financial services.
  • The current state of data science in financial services in both the U.S. and China.
  • His experience recruiting, training, and managing data science teams in both the U.S. and China.

Continue reading “Applications of data science and machine learning in financial services”

Why companies are in need of data lineage solutions

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.

In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up.

There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.

Here are some highlights from our conversation:

Data lineage

Data lineage is not something new. It’s something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I’m describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful.
Continue reading “Why companies are in need of data lineage solutions”

What data scientists and data engineers can do with current generation serverless technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Avner Braverman on what’s missing from serverless today and what users should expect in the near future.

In this episode of the Data Show, I spoke with Avner Braverman, co-founder and CEO of Binaris, a startup that aims to bring serverless to web-scale and enterprise applications. This conversation took place shortly after the release of a seminal paper from UC Berkeley (“Cloud Programming Simplified: A Berkeley View on Serverless Computing”), and this paper seeded a lot of our conversation during this episode.

Serverless is clearly on the radar of data engineers and architects. In a recent survey, we found 85% of respondents already had parts of their data infrastructure in one of the public clouds, and 38% were already using at least one of the serverless offerings we listed. As more serverless offerings get rolled out—e.g., things like PyWren that target scientists—I expect these numbers to rise.

We had a great conversation spanning many topics, including:

  • A short history of cloud computing.
  • The fundamental differences between serverless and conventional cloud computing.
  • The reasons serverless—specifically AWS Lambda—took off so quickly.
  • What can data scientists and data engineers do with the current generation serverless offerings.
  • What is missing from serverless today and what should users expect in the near future.

Related resources:

Algorithms are shaping our lives – here’s how we wrest back control

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Kartik Hosanagar on the growing power and sophistication of algorithms.

In this episode of the Data Show, I spoke with Kartik Hosanagar, professor of technology and digital business, and professor of marketing at The Wharton School of the University of Pennsylvania.  Hosanagar is also the author of a newly released book, A Human’s Guide to Machine Intelligence, an interesting tour through the recent evolution of AI applications, which draws from his extensive experience at the intersection of business and technology.

We had a great conversation spanning many topics, including:

  • The types of unanticipated consequences of which algorithm designers should be aware.
  • The predictability-resilience paradox: as systems become more intelligent and dynamic, they also become more unpredictable, so there are trade-offs algorithms designers must face.
  • Managing risk in machine learning: AI application designers need to weigh considerations such as fairness, security, privacy, explainability, safety, and reliability.
  • A bill of rights for humans impacted by the growing power and sophistication of algorithms.
  • Some best practices for bringing AI into the enterprise.

Related resources:

 

Why your attention is like a piece of contested territory

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: P.W. Singer on how social media has changed, war, politics, and business.

In this episode of the Data Show, I spoke with P.W. Singer, strategist and senior fellow at the New America Foundation, and a contributing editor at Popular Science. He is co-author of an excellent new book, LikeWar: The Weaponization of Social Media, which explores how social media has changed war, politics, and business. The book is essential reading for anyone interested in how social media has become an important new battlefield in a diverse set of domains and settings.

We had a great conversation spanning many topics, including:

  • In light of the 10th anniversary of his earlier book Wired for War, we talked about progress in robotics over the past decade.
  • The challenge posed by the fact that social networks reward virality, not veracity.
  • How the internet has emerged as an important new battlefield.
  • How this new online battlefield changes how conflicts are fought and unfold.
  • How many of the ideas and techniques covered in LikeWarare trickling down from nation-state actors influencing global events, to consulting companies offering services that companies and individuals can use.

Continue reading “Why your attention is like a piece of contested territory”

The technical, societal, and cultural challenges that come with the rise of fake media

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Siwei Lyu on machine learning for digital media forensics and image synthesis.

In this episode of the Data Show, I spoke with Siwei Lyu, associate professor of computer science at the University at Albany, State University of New York. Lyu is a leading expert in digital media forensics, a field of research into tools and techniques for analyzing the authenticity of media files. Over the past year, there have been many stories written about the rise of tools for creating fake media (mainly images, video, audio files). Researchers in digital image forensics haven’t exactly been standing still, though. As Lyu notes, advances in machine learning and deep learning have also found a receptive audience among the forensics community.

We had a great conversation spanning many topics including:

  • The many indicators used by forensic experts and forgery detection systems
  • Balancing “open” research with risks that come with it—including “tipping off” adversaries
  • State-of-the-art detection tools today, and what the research community and funding agencies are working on over the next few years.
  • Technical, societal, and cultural challenges that come with the rise of fake media.

Here are some highlights from our conversation:
Continue reading “The technical, societal, and cultural challenges that come with the rise of fake media”

Using machine learning and analytics to attract and retain employees

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Maryam Jahanshahi on building tools to help improve efficiency and fairness in how companies recruit.

In this episode of the Data Show, I spoke with Maryam Jahanshahi, research scientist at TapRecruit, a startup that uses machine learning and analytics to help companies recruit more effectively. In an upcoming survey, we found that a “skills gap” or “lack of skilled people” was one of the main bottlenecks holding back adoption of AI technologies. Many companies are exploring a variety of internal and external programs to train staff on new tools and processes. The other route is to hire new talent. But recent reports suggest that demand for data professionals is strong and competition for experienced talent is fierce. Jahanshahi and her team are building natural language and statistical tools that can help companies improve their ability to attract and retain talent across many key areas.

Here are some highlights from our conversation:

Optimal job titles

The conventional wisdom in our field has always been that you want to optimize for “the number of good candidates” divided by “the number of total candidates.” … The thinking is that one of the ways in which you get a good signal-to-noise ratio is if you advertise for a more senior role. … In fact, we found the number of qualified applicants was lower for the senior data scientist role.

… We saw from some of our behavioral experiments that people were feeling like that was too senior a role for them to apply to. What we would call the “confidence gap” was kicking in at that point. It’s a pretty well-known phenomena that there are different groups of the population that are less confident. This has been best characterized in terms of gender. It’s the idea that most women only apply for jobs when they meet 100% of the qualifications versus most men will apply even with 60% of the qualifications. That was actually manifesting.

Highlighting benefits

We saw a lot of big companies that would offer 401(k), that would offer health insurance or family leave, but wouldn’t mention those benefits in the job descriptions. This had an impact on how candidates perceived these companies. Even though it’s implied that Coca-Cola is probably going to give you 401(k) and health insurance, not mentioning it changes the way you think of that job.
Continue reading “Using machine learning and analytics to attract and retain employees”