Gradient Flow #42: Data Quality; Oscilloscope for Deep Learning; Feature Stores

Subscribe • Previous Issues

“The slow philosophy is not about doing everything in tortoise mode. It’s less about the speed and more about investing the right amount of time and attention in the problem so you solve it.” – Carl Honoré

Data Exchange podcast

[Image: Dana King’s “Monumental Reckoning”, by BL]

Data & Machine Learning tools and infrastructure

  • Data Quality Unpacked   Kenn So (of Shasta Ventures) and I list some new solutions and startups, and we list key features to look for in a data quality solution.
  • How Ikigai Labs Serves Interactive AI Workflows at Scale using Ray Serve  “Ray Serve can serve not only the various deep learning models, but also arbitrary Python code in a distributed manner. Since one of the biggest missions in the Ikigai data pipeline is to run user’s arbitrary Python code at scale with interactivity, Ray Serve provided answers to many challenges we faced as it enabled us to serve users’ code with real-time interaction.”
  • What feature stores are and how they are used today    A short overview from a group of UC Berkeley PhD students. A recent VLDB tutorial by a team from Stanford, Apple, and Uber, also contains a good description of feature stores. The VLDB tutorial instructors predict that next generation feature stores will provide native support for embeddings (derived data in the form of low-dimensional, learned continuous vector representations). This would require tools for searching and querying embeddings as well as support for versioning, provenance, and downstream quality metrics.
  • Jurassic-1 Jumbo, is the largest model ever made available to developers   J1-Jumbo is a 178B-parameter model, and J1-Large is a 7B-parameter model.
  • Hora, is an approximate nearest neighbor search algorithm written in Rust that comes with a Python API.
  • HashiCorp State of Cloud Strategy Survey    “76% are already multi-cloud.”  As I pointed out in a short post last year, surveys consistently show that a vast majority of respondents work at companies that use multiple clouds.
Figure 7: A representative sample of modern data quality solutions offered by startups and open source providers. Graphic: Gradient Flow.


Closing Short:  Dance, Dance, Dance

If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe:

%d bloggers like this: