Issue #4: Scaling Machine Learning, Lakehouses, and Learning from Experiments

Recommendations:
Experimentation Works: The Surprising Power of Business Experiments. This is a book data scientists should give to their CxO. How do you build an experimentation organization, specifically one that has the pieces in place to run experiments at scale and at high velocity? Knowledge learned from all experiments (failure or success) should be archived. As Thomas Edison observed “the most important part of an experimental laboratory is a big scrap heap”. Data science and ML teams are thankfully coming to the same conclusion! For example, MLflow’s new model registry – which records the settings of ML experiments – has quickly become an important component of this very popular open source project.

Scalable Machine Learning, Scalable Python, For Everyone
My Ray Summit co-chair and Head of Developer Relations at Anyscale, Dean Wampler, on his first impressions of Ray, and his journey from Scala to Python.

Episode 11: Dean Wampler

Subscribe: iTunes, Android, Spotify, Stitcher, Google, and RSS.

The state of privacy-preserving machine learning
Morten Dahl (the creator of TF Encrypted) provides a great overview of the tools and techniques for privacy-preserving ML, as well as the many challenges that remain.

Building domain specific natural language applications
Building on their work on Spark NLP, David Talby explains how his team built natural language models tuned specifically for healthcare applications.

What is a Lakehouse?
I wrote a post with some of the founders of Databricks, where we described a new data management paradigm for the age of machine learning and AI. A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data.

Machine Learning tools and infrastructure

  • To understand Ray’s potential, scan the variety of libraries already written on top of it: reinforcement learning (RLlib) and hyperparameter tuning (Tune) immediately come to mind. It’s also already being used for other distributed computing workloads. There are libraries for streaming (to be open sourced this year), and for model servingRay Serve.
  • Trax is a library for advanced deep learning built on JAX, actively used and maintained by the Google Brain team. Combined with the rise of PyTorch among researchers, the future of TensorFlow in the research community isn’t looking rosy.
  • Tools for training large DL models are in demand. Here are two new open source parameter servers from Chinese tech companies: BytePS from ByteDance and Angel-ML from Tencent.
  • The team behind Holoclean continues to make progress in building tools for data quality. Ihab Ilyas recently sketched out Inductiv for structured data, a scalable system for automatic error detection and repair.
  • I’m hearing more about two open source tools for BI: Redash and Apache Superset.

Work and hiring:

Conferences Roundup
Updates on upcoming SF Bay Area conferences I am co-chairing. Both of these events will be outstanding:

  • The Ray Summit (May 27-28 in San Francisco) preliminary schedule has just been announced! We have great keynotes and sessions, with more speakers to be confirmed shortly.
  • The Spark+AI Summit (June 22-25 in San Francisco) schedule will be announced in a few weeks. In the meantime, trainings and some outstanding keynotes are already on the web site.

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.

[Image: Newsletter from Pixabay.]