Gradient Flow #42: Data Quality; Oscilloscope for Deep Learning; Feature Stores

Subscribe • Previous Issues

“The slow philosophy is not about doing everything in tortoise mode. It’s less about the speed and more about investing the right amount of time and attention in the problem so you solve it.” – Carl Honoré

Data Exchange podcast

An oscilloscope for deep learning Charles Martin is the founder of Calculation Consulting, a boutique consultancy focused on data science and machine learning. Along with Michael Mahoney and Serena Peng, Charles is co-author of a recent Nature paper on new methods for evaluating and tuning deep learning models.
Auditing machine learning models for discrimination, bias, and other risks I convene a panel with experts and get an update on Responsible AI. Rayid Ghani, is a Distinguished Career Professor in the Machine Learning Department and the Heinz College of Information Systems and Public Policy at Carnegie Mellon University, and Andrew Burt, co-founder and Managing Partner of BNH.ai, a new law firm focused on AI compliance, risk mitigation, and related topics.

[Image: Dana King’s *“Monumental Reckoning”*, by BL]

Data & Machine Learning tools and infrastructure

Data Quality Unpacked Kenn So (of Shasta Ventures) and I list some new solutions and startups, and we list key features to look for in a data quality solution.
How Ikigai Labs Serves Interactive AI Workflows at Scale using Ray Serve “Ray Serve can serve not only the various deep learning models, but also arbitrary Python code in a distributed manner. Since one of the biggest missions in the Ikigai data pipeline is to run user’s arbitrary Python code at scale with interactivity, Ray Serve provided answers to many challenges we faced as it enabled us to serve users’ code with real-time interaction.”
What feature stores are and how they are used today A short overview from a group of UC Berkeley PhD students. A recent VLDB tutorial by a team from Stanford, Apple, and Uber, also contains a good description of feature stores. The VLDB tutorial instructors predict that next generation feature stores will provide native support for embeddings (derived data in the form of low-dimensional, learned continuous vector representations). This would require tools for searching and querying embeddings as well as support for versioning, provenance, and downstream quality metrics.
Jurassic-1 Jumbo, is the largest model ever made available to developers J1-Jumbo is a 178B-parameter model, and J1-Large is a 7B-parameter model.
Hora, is an approximate nearest neighbor search algorithm written in Rust that comes with a Python API.
HashiCorp State of Cloud Strategy Survey “76% are already multi-cloud.” As I pointed out in a short post last year, surveys consistently show that a vast majority of respondents work at companies that use multiple clouds.

Figure 7: A representative sample of modern data quality solutions offered by startups and open source providers. Graphic: Gradient Flow.

Recommendations

Algorithms and Economic Justice: A Taxonomy of Harms and a Path Forward for the Federal Trade Commission At a minimum, Section II (Algorithmic Harms) is worth reading. This is a new whitepaper by FTC commissioner, Rebecca Kelly Slaughter. The FTC is the main US agency charged with protecting consumers and promoting competition.
Reinforcement Learning in data engineering and data management This survey paper collects RL use cases in Data System Optimizations (see Table 1), Data Analytics and Data Processing (see Table 4).
Designing Interactive Transfer Learning Tools for ML Non-Experts A 5-min video that accompanies the Best Paper from the recent ACM CHI conference. We really need better tools that will let non-experts repurpose and tweak existing, pretrained machine-learning models produced by the research community.
It’s Time to Retire the CSV file format
Five Commonly Used Idioms in the Tech Industry Read this so you’ll know what your tech/startup friends or colleagues are talking about.

Closing Short: Dance, Dance, Dance

Mesmerizing.. pic.twitter.com/J5eZZY9JVX

— Buitengebieden (@buitengebieden_) August 12, 2021

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:

Data Exchange podcast

Data & Machine Learning tools and infrastructure

Recommendations

Share this:

Like this:

Discover more from Gradient Flow