Gradient Flow #33: DataOps, Natural Language Benchmarks, Multimodal ML

Subscribe • Previous Issues

This edition has 548 words which will take you about 3 minutes to read.

“While you are looking, you might as well also listen, linger and think about what you see.” – Jane Jacobs

Data Exchange podcast

How Technology Companies Are Using Ray Zhe Zhang is an Engineering Manager at Anyscale where he leads the team that works on the Ray and its ecosystem of libraries and partners. We discussed the Ray ecosystem and large-scale use cases at Ant Group, Uber, Amazon, and more.
Building a data store for unstructured data and deep learning applications The main bottleneck at most companies remains data and fortunately there are many new startups in data infrastructure. I speak with Davit Buniatyan, founder and CEO of ActiveLoop, a startup building data management tools for unstructured data types commonly associated with deep learning.

What is DataOps? Assaf Araki and I describe tools, processes, startups that are helping organizations deliver AI and data products and services quickly, reliably, and efficiently.
What Will it Take to Fix Benchmarking in Natural Language Understanding? Most NLU benchmarks are based on crowdsourced data and subjected to limited quality control. This position paper argues that NLU benchmarks are broken and proposes criteria for future benchmark datasets. I love this renewed focus on datasets and the recognition that investing in better benchmark datasets will lead to more robust and better behaved models.
The deck used to raise a seed round in less than two weeks This post from the founders of Airbyte (an open source ELT tool) provides an update on data integration, a resurgent segment within the data infrastructure space.
Flashlight is a new open source machine learning library, from Facebook, written entirely in C++
An online resource allocation system based on Ray China’s Ant Group describes a platform they used for last year’s Double 11, the largest online shopping event in the world

SambaNova Systems Raises $676M Series D
Bigeye raises $17M Series A Formerly known as Toro, this is a data quality startup founded by Uber alumni.

Multimodal Machine Learning Lectures from a very popular Carnegie Mellon course, on building models that utilize information and generate signals from multiple modalities (vision, speech, language, etc.).
Machine Learning with Graphs Videos from an ongoing Stanford course taught by Jure Lescovec. Graphs are used to describe entities with relations or interactions and in many cases they can provide more accurate representations of your data. It stands to reason that if one can leverage the relational structure inherent in graphs, this will translate to more accurate machine learning models. To quote Jure: “Graphs are the new frontier of deep learning.”
Data for Better Lives This new World Bank report calls “for a new social contract that enables the use and reuse of data to create economic and social value, ensures equitable access to that value, and fosters trust that data will not be misused in harmful ways”.
Strategic Prediction: Transparency and Accuracy in Predictive Decision Making “When a measure becomes a target it ceases to be a good measure.” This recent ACM tutorial describes tools for machine learning developers who increasingly need to address a phenomenon familiar to economists (Goodhart’s Law) and social scientists (Campbell’s Law).
Technology Radar (Thoughtworks Advisory Board) Recommendations include tools and technologies that cut across many areas of interest to developers, managers, and CTOs.

Closing Short: This well executed video essay arrives at a time when we are close to being able to travel again!

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe: