Model Prebuilts¶
Overview¶
MLC-LLM is a universal solution for deploying different language models. Any models that can be described in TVM Relax (a general representation for Neural Networks and can be imported from models written in PyTorch) can be recognized by MLC-LLM and thus deployed to different backends with the help of TVM Unity.
There are two ways to run a model on MLC-LLM (this page focuses on the second one):
Compile your own models following the model compilation page.
Use off-the-shelf prebuilt models following this current page.
In order to run a specific model on MLC-LLM, you need:
1. A model library: a binary file containing the end-to-end functionality to inference a model (e.g. Llama-2-7b-chat-hf-q4f16_1-cuda.so
).
See the full list of all precompiled model libraries here.
2. Compiled weights: a folder containing multiple files that store the compiled and quantized weights of a model (e.g. https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC). See the full list of all precompiled weights here.
In this page, we first quickly go over how to use prebuilts for different platforms, then track what current prebuilt models we provide.
Using Prebuilt Models for Different Platforms¶
We quickly go over how to use prebuilt models for each platform. You can find detailed instruction on each platform’s corresponding page.
Prebuilt Models on CLI / Python
For more, please see the CLI page, and the the Python page.
Click to show details
First create the conda environment if you have not done so.
conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly conda activate mlc-chat-venv conda install git git-lfs git lfs install
Download the prebuilt model libraries from github.
mkdir dist/ git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt_libs
Run the model with CLI:
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
To run the model with Python API, see the Python page (all other downloading steps are the same as CLI).
Prebuilt Models on iOS
For more, please see the iOS page.
Click to show details
The iOS app has builtin RedPajama-3B and Mistral-7B-Instruct-v0.2 support.
All prebuilt models with an entry in iOS
in the model library table are supported by iOS. Namely, we have:
Model Code |
Model Series |
Quantization Mode |
MLC HuggingFace Weights Repo |
---|---|---|---|
Mistral-7B-Instruct-v0.2-q3f16_1 |
|
||
RedPajama-INCITE-Chat-3B-v1-q4f16_1 |
|
||
phi-2-q4f16_1 |
|
Prebuilt Models on Android
For more, please see the Android page.
Click to show details
The apk for demo Android app includes the following models. To add more, check out the Android page.
Model code |
Model Series |
Quantization Mode |
Hugging Face repo |
---|---|---|---|
Llama-2-7b-q4f16_1 |
|
||
RedPajama-INCITE-Chat-3B-v1-q4f16_1 |
|
Level 1: Supported Model Architectures (The All-In-One Table)¶
For each model architecture (e.g. Llama), there are multiple variants (e.g. CodeLlama, WizardLM). The variants share the same code for inference and only differ in their weights. In other words, running CodeLlama and WizardLM can use the same model library file (specified in Level 2 tables), but different precompiled weights (specified in Level 3 tables). Note that we have not provided prebuilt weights for all model variants.
Each entry below hyperlinks to the corresponding level 2 and level 3 tables.
MLC-LLM supports the following model architectures:
Model Architecture |
Support |
Available MLC Prebuilts |
Unavailable in MLC Prebuilts |
---|---|---|---|
If the model variant you are interested in uses one of these model architectures we support, (but we have not provided the prebuilt weights yet), you can check out Convert Weights via MLC on how to convert the weights. Afterwards, you may follow (Optional) 3. Upload weights to HF to upload your prebuilt weights to hugging face, and submit a PR that adds an entry to this page, contributing to the community.
For models structured in an architecture we have not supported yet, you could:
Either create a [Model Request] issue which automatically shows up on our Model Request Tracking Board.
Or follow our tutorial Define New Models, which introduces how to bring a new model architecture to MLC-LLM.
Level 2: Model Library Tables (Precompiled Binary Files)¶
As mentioned earlier, each model architecture corresponds to a different model library file. That is, you cannot use the same model library file to run RedPajama
and Llama-2
. However, you can use the same Llama
model library file to run Llama-2
, WizardLM
, CodeLlama
, etc, but just with different weight files (from tables in Level 3).
Each table below demonstrates the pre-compiled model library files for each model architecture. This is categorized by:
Size: each size of model has its own distinct model library file (e.g. 7B or 13B number of parameters)
Platform: the backend that the model library is intended to be run on (e.g. CUDA, ROCm, iphone, etc.)
Quantization scheme: the model library file also differs due to the quantization scheme used. For more on this, please see the quantization page (e.g.
q3f16_1
vs.q4f16_1
).
Each entry links to the specific model library file found in this github repo.
If the model library you found is not available as a prebuilt, you can compile it yourself by following the model compilation page, and submit a PR to the repo binary-mlc-llm-libs afterwards.
Llama¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M Chip) |
Metal (Intel) |
iOS |
Android |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|---|
7B |
||||||||||
13B |
||||||||||
34B |
||||||||||
70B |
Mistral¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M Chip) |
Metal (Intel) |
iOS |
Android |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|---|
7B |
GPT-NeoX (RedPajama-INCITE)¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M Chip) |
Metal (Intel) |
iOS |
Android |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|---|
3B |
GPTBigCode¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M Chip) |
Metal (Intel) |
iOS |
Android |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|---|
15B |
Phi¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M Chip) |
Metal (Intel) |
iOS |
Android |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|---|
Phi-2 (2.7B) |
||||||||||
Phi-1.5 (1.3B) |
GPT2¶
CUDA |
ROCm |
Vulkan (Linux) |
Vulkan (Windows) |
Metal (M Chip) |
Metal (Intel) |
iOS |
Android |
webgpu |
mali |
|
---|---|---|---|---|---|---|---|---|---|---|
GPT2 (124M) |
||||||||||
GPT2-med (355M) |
Level 3: Model Variant Tables (Precompiled Weights)¶
Finally, for each model variant, we provide the precompiled weights we uploaded to hugging face.
Each precompiled weight is categorized by its model size (e.g. 7B vs. 13B) and the quantization scheme (e.g. q3f16_1
vs. q4f16_1
). We note that the weights are platform-agnostic.
Each model variant also loads its conversation configuration from a pre-defined conversation template. Note that multiple model variants can share a common conversation template.
Some of these files are uploaded by our community contributors–thank you!
Llama-2¶
Conversation template: llama-2
Size |
Hugging Face Repo Link |
---|---|
7B |
|
13B |
|
70B |
Mistral¶
Conversation template: mistral_default
Size |
Hugging Face Repo Link |
---|---|
7B |
NeuralHermes-2.5-Mistral¶
Conversation template: neural_hermes_mistral
Size |
Hugging Face Repo Link |
---|---|
7B |
OpenHermes-2-Mistral¶
Conversation template: open_hermes_mistral
Size |
Hugging Face Repo Link |
---|---|
7B |
WizardMath V1.1¶
Conversation template: wizard_coder_or_math
Size |
Hugging Face Repo Link |
---|---|
7B |
RedPajama¶
Conversation template: redpajama_chat
Size |
Hugging Face Repo Link |
---|---|
3B |
Phi¶
Conversation template: phi-2
Size |
Hugging Face Repo Link |
---|---|
Phi-2 (2.7B) |
|
Phi-1.5 (1.3B) |
GPT2¶
Conversation template: gpt2
Size |
Hugging Face Repo Link |
---|---|
GPT2 (124M) |
|
GPT2-medium (355M) |
Contribute Models to MLC-LLM¶
Ready to contribute your compiled models/new model architectures? Awesome! Please check Contribute New Models to MLC-LLM on how to contribute new models to MLC-LLM.