Data Exchange podcast
- Unleashing the power of large language models: If you work with text, you should incorporate transformer-based language models into your NLP pipelines. You can either build your own tools or use libraries that come with pre-trained models. Maarten Grootendorst, is the author of open source libraries that I’ve come to love: BERTopic (topic modeling with transformers and c-TF-IDF), PolyFuzz (fuzzy string matching), and KeyBERT (keyword extraction). All these libraries come with simple Python APIs, are well-documented, and BERTopic comes with several nice visualizations.
- Machine Learning for Time Series Intelligence: Aadyot Bhatnagar, is a Senior Research Engineer at Salesforce, and co-creator of Merlion, an open source framework for applying machine learning on time series data. Merlion supports a wide range of time series learning tasks including forecasting, anomaly detection, and change point detection. I’ve long wanted (declarative) tools that make time series analysis and modeling more accessible to non-experts. New libraries like Nixtla, Merlion, Kats, and Greykite are steps in the right direction.
- Building production-ready machine learning pipelines: Hamza Tahir and Adam Probst are co-creators of ZenML, an extensible open source framework for building reproducible pipelines.
Ray AI Runtime (AIR): A scalable and unified toolkit for ML applications
Officially announced at this week’s Ray Summit, AIR unifies Ray’s existing native ML libraries to work smoothly together and integrate easily with popular ML frameworks. AIR makes it easy to run ML workloads in just a few lines of Python code, leaving Ray to coordinate computations at scale.
Confidential Computing and Machine Learning
We assess the popularity of various Confidential Computing tools, and explain why Confidential Computing can now be used for analytics and machine learning (both for model inference and model training):
Foundation Models: A non-technical primer
Kenn So of Shasta Ventures and I put together an overview of a class of models that have had an impact on computer vision, text, and speech applications. We list implications for product builders, entrepreneurs, and investors:
The best data warehouse is a lakehouse
A short summary of Databricks SQL (DBSQL) initiatives pertaining to classic data warehousing, data transformation & ingest, connectivity, and other items that are redefining analytics on the lakehouse.
If you enjoyed this newsletter please support our work by encouraging your friends and colleagues to subscribe: