Open Source Principles in Foundation Models

The launch of Mistral 7B prompted me to reflect on the concept of open source in relation to Large language Models (LLMs). In essence, an open source LLM is a model whose code is publicly available under an open source license, allowing anyone to use, modify, and distribute the model. Open source machine learning models are typically accompanied by comprehensive documentation that provides insights into the model’s architecture, training methodologies, weights, parameters, and the datasets used for training and evaluation, enabling a deeper understanding of the model’s design and functionality.

The appropriate open source license is contingent upon your specific use case. It is preferable for the LLM to be distributed under one of several permissive licenses, such as the MIT License, Apache License 2.0, or the BSD 3-Clause License.  Assuming a permissive license is used, an open source LLM is an amalgamation of several crucial components that accompanies its release:

Assuming a permissive license, here are key items to accompany the release of an open-source foundation model.
  1. Training Data: Detailed information, or ideally access, to the datasets on which the model was trained allows for insights into the diversity, representativeness, and potential biases inherent in the data.
  2. Training Code: The program or script used to train a machine learning model. It includes the algorithm used, the loss function, optimization method, and other details about how the model is trained on the training data.
  3. Model Architecture: This the blueprint of a machine learning model, defining the layers, nodes, and connections between them. It determines how the model processes input data and makes predictions or classifications.
  4. Model Weights and Parameters: These are the learned values that a model uses to make predictions. They are crucial for utilizing the pre-trained model, as they determine how the model responds to new input.
  5. Hyperparameters: These are the settings that control the learning process of a machine learning model. They are not part of the model itself, but they have a significant impact on the model’s performance. Some common hyperparameters include learning rate, batch size, and regularization strength.
  6. Preprocessing and Evaluation Code: The code used to prepare input data and evaluate model performance. These items ensure that users can correctly format their data and understand how well the model is likely to perform on new data.
  7. Documentation: Comprehensive and informative instructions on how to use the model, including the training process, dependencies, and requirements.
  8. Detailed Training Setup and Configuration: Information on the training setup allowing for accurate reproduction of the training environment.

Open source LLMs have proven to be capable and viable options for a range of applications and tasks, providing additional benefits such as more control over deployment options and settings. However, the top-tier open source foundation models still demand significant resources for training, with a majority originating from a select group of companies, like Meta. There’s a risk that future versions of these foundation models will come with restrictive licenses, so we definitely need a broader range of suppliers of open source foundation models.


If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Discover more from Gradient Flow

Subscribe now to keep reading and get access to the full archive.

Continue reading