Issue #4: Scaling Machine Learning, Lakehouses, and Learning from Experiments

This edition has 681 words which will take you about 4 minutes to read.

Recommendations:
Experimentation Works: The Surprising Power of Business Experiments. This is a book data scientists should give to their CxO. How do you build an experimentation organization, specifically one that has the pieces in place to run experiments at scale and at high velocity? Knowledge learned from all experiments (failure or success) should be archived. As Thomas Edison observed “the most important part of an experimental laboratory is a big scrap heap”. Data science and ML teams are thankfully coming to the same conclusion! For example, MLflow’s new model registry – which records the settings of ML experiments – has quickly become an important component of this very popular open source project.

Note that not everything can be subjected to an experiment, a point brought home by this 2003 satirical study on the benefits of using a parachute during free fall (also see this followup study from 2018).
A product manager on lessons learned from Good Experiments and Bad Experiments.
Drug Regulation in the Era of Individualized Therapies: ethical implications raised by a recent N of 1 trial.

Scalable Machine Learning, Scalable Python, For Everyone
My Ray Summit co-chair and Head of Developer Relations at Anyscale, Dean Wampler, on his first impressions of Ray, and his journey from Scala to Python.

Episode 11: Dean Wampler

Subscribe: iTunes, Android, Spotify, Stitcher, Google, and RSS.

The state of privacy-preserving machine learning
Morten Dahl (the creator of TF Encrypted) provides a great overview of the tools and techniques for privacy-preserving ML, as well as the many challenges that remain.

Building domain specific natural language applications
Building on their work on Spark NLP, David Talby explains how his team built natural language models tuned specifically for healthcare applications.

What is a Lakehouse?
I wrote a post with some of the founders of Databricks, where we described a new data management paradigm for the age of machine learning and AI. A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data.

Machine Learning tools and infrastructure

To understand Ray’s potential, scan the variety of libraries already written on top of it: reinforcement learning (RLlib) and hyperparameter tuning (Tune) immediately come to mind. It’s also already being used for other distributed computing workloads. There are libraries for streaming (to be open sourced this year), and for model serving – Ray Serve.
Trax is a library for advanced deep learning built on JAX, actively used and maintained by the Google Brain team. Combined with the rise of PyTorch among researchers, the future of TensorFlow in the research community isn’t looking rosy.
Tools for training large DL models are in demand. Here are two new open source parameter servers from Chinese tech companies: BytePS from ByteDance and Angel-ML from Tencent.
The team behind Holoclean continues to make progress in building tools for data quality. Ihab Ilyas recently sketched out Inductiv for structured data, a scalable system for automatic error detection and repair.
I’m hearing more about two open source tools for BI: Redash and Apache Superset.

Work and hiring:

Towards more precise job ads: Most job postings at Stripe describe “Projects you could work on”, some also mention “Who you’ll work with”. Here’s a sample job ad for an Infrastructure Engineer that contains both.
Measuring employee performance by surveillance: a disturbing overview of nontraditional monitoring tools.
The future of work is remote: SF Bay Area technology companies are learning how to build distributed teams.
Bi-annual reflection, instead of OKRs? Bryan Cantrill proposes a performance management methodology for engineers.

Conferences Roundup
Updates on upcoming SF Bay Area conferences I am co-chairing. Both of these events will be outstanding:

The Ray Summit (May 27-28 in San Francisco) preliminary schedule has just been announced! We have great keynotes and sessions, with more speakers to be confirmed shortly.
The Spark+AI Summit (June 22-25 in San Francisco) schedule will be announced in a few weeks. In the meantime, trainings and some outstanding keynotes are already on the web site.

Subscribe to our Newsletter:
We also publish a popular newsletter where we share highlights from recent episodes, trends in AI / machine learning / data, and a collection of recommendations.

[Image: Newsletter from Pixabay.]

Share this:

Like this:

Discover more from Gradient Flow