Gradient Flow #45: Top Places to Work for Data Scientists; Model Serving; Tuning Language Models

Subscribe • Previous Issues

“There’s no sense in being precise when you don’t even know what you’re talking about.” – John von Neumann

Data Exchange podcast

Deploying Machine Learning Models Safely and Systematically Hamel Husain is Staff Machine Learning Engineer at GitHub, and a core developer for fastai.
Machine Learning in Astronomy and Physics Dr. Viviana Acquaviva, Associate Professor at the CUNY Graduate Center, is an Astrophysicist with a strong interest in Data Science and Machine Learning.
Large-scale machine learning and AI on multi-modal data Bob Friday is VP and CTO at Mist Systems a Juniper Company. His team uses data, machine learning, and AI to “optimize user experiences and simplify operations across the wireless access, wired access, and SD-WAN domains”. They’ve deployed deep learning models for anomaly detection, and virtual assistants that provide insight and guidance to IT staff via a conversational interface.

Immediate 3X serving speed up with Ray Serve Ray Serve is quietly becoming one of the more popular open source libraries for model serving. Learn how Wildlife Studios – one of the largest mobile gaming companies in the world – successfully deployed Ray Serve to deliver in-game offers.
cleanlab: machine learning with noisy labels An open source library for confident learning, an approach that involves pruning noisy data (as opposed to fixing label errors), and ranking examples to train with confidence.
Designing data ingestion pipelines ML practitioners understand that scaling data ingestion pipelines is crucial and inefficiencies at this stage can really cripple training throughput. Through the lens of deep learning for recommendation systems, a team from Facebook and Stanford present an architecture for end-to-end training data ingestion.
Zingg We live in an age where companies have data in disparate systems. In this context, scalable entity resolution and master data management systems bring tremendous benefits to downstream analytic and machine learning applications. Zingg is a new open source library for large-scale entity resolution. It’s built on top of Apache Spark.

A simple method to improve the zero-shot performance of large language models This paper introduces tools developed by Google researchers, who show that instruction tuning improves a language model’s ability to perform unknown tasks.
Top Places to Work for Data Scientists Here are three lists for different career stages, one for a newly minted data scientist, one for an experienced data scientist, and one for those in leadership roles.
The Road to Citizen Data Science Video of my opening keynote at the first Citizen Data Science Summit at MIT.
Secure computation: Homomorphic encryption or hardware enclaves? A must-read overview of the tools available for collaborating with confidential information without sharing it.
Why Can’t I Find the Right Data? In order to make your data discovery tools truly self-serviceable, this post describes the information and metadata you must assemble.

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe: