Model Prebuilts

Overview

MLC-LLM is a universal solution for deploying different language models. Any models that can be described in TVM Relax (a general representation for Neural Networks and can be imported from models written in PyTorch) can be recognized by MLC-LLM and thus deployed to different backends with the help of TVM Unity.

There are two ways to run a model on MLC-LLM (this page focuses on the second one):

  1. Compile your own models following the model compilation page.

  2. Use off-the-shelf prebuilt models following this current page.

In order to run a specific model on MLC-LLM, you need:

1. A model library: a binary file containing the end-to-end functionality to inference a model (e.g. Llama-2-7b-chat-hf-q4f16_1-cuda.so). See the full list of all precompiled model libraries here.

2. Compiled weights: a folder containing multiple files that store the compiled and quantized weights of a model (e.g. https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC). See the full list of all precompiled weights here.

In this page, we first quickly go over how to use prebuilts for different platforms, then track what current prebuilt models we provide.

Using Prebuilt Models for Different Platforms

We quickly go over how to use prebuilt models for each platform. You can find detailed instruction on each platform’s corresponding page.

Prebuilt Models on CLI / Python

For more, please see the CLI page, and the the Python page.

Click to show details

First create the conda environment if you have not done so.

conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly
conda activate mlc-chat-venv
conda install git git-lfs
git lfs install

Download the prebuilt model libraries from github.

mkdir dist/
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt_libs

Run the model with CLI:

mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

To run the model with Python API, see the Python page (all other downloading steps are the same as CLI).


Prebuilt Models on iOS

For more, please see the iOS page.

Click to show details

The iOS app has builtin RedPajama-3B and Mistral-7B-Instruct-v0.2 support.

All prebuilt models with an entry in iOS in the model library table are supported by iOS. Namely, we have:

Prebuilt Models for iOS

Model Code

Model Series

Quantization Mode

MLC HuggingFace Weights Repo

Mistral-7B-Instruct-v0.2-q3f16_1

Mistral

  • Weight storage data type: int3

  • Running data type: float16

  • Symmetric quantization

link

RedPajama-INCITE-Chat-3B-v1-q4f16_1

RedPajama

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link

phi-2-q4f16_1

Microsoft Phi-2

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link


Prebuilt Models on Android

For more, please see the Android page.

Click to show details

The apk for demo Android app includes the following models. To add more, check out the Android page.

Prebuilt Models for Android

Model code

Model Series

Quantization Mode

Hugging Face repo

Llama-2-7b-q4f16_1

Llama

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link

RedPajama-INCITE-Chat-3B-v1-q4f16_1

RedPajama

  • Weight storage data type: int4

  • Running data type: float16

  • Symmetric quantization

link


Level 1: Supported Model Architectures (The All-In-One Table)

For each model architecture (e.g. Llama), there are multiple variants (e.g. CodeLlama, WizardLM). The variants share the same code for inference and only differ in their weights. In other words, running CodeLlama and WizardLM can use the same model library file (specified in Level 2 tables), but different precompiled weights (specified in Level 3 tables). Note that we have not provided prebuilt weights for all model variants.

Each entry below hyperlinks to the corresponding level 2 and level 3 tables.

MLC-LLM supports the following model architectures:

Supported Model Architectures

Model Architecture

Support

Available MLC Prebuilts

Unavailable in MLC Prebuilts

LLaMA

Mistral

GPT-NeoX

GPTBigCode

Phi

GPT2

If the model variant you are interested in uses one of these model architectures we support, (but we have not provided the prebuilt weights yet), you can check out Convert Weights via MLC on how to convert the weights. Afterwards, you may follow (Optional) 3. Upload weights to HF to upload your prebuilt weights to hugging face, and submit a PR that adds an entry to this page, contributing to the community.

For models structured in an architecture we have not supported yet, you could:

Level 2: Model Library Tables (Precompiled Binary Files)

As mentioned earlier, each model architecture corresponds to a different model library file. That is, you cannot use the same model library file to run RedPajama and Llama-2. However, you can use the same Llama model library file to run Llama-2, WizardLM, CodeLlama, etc, but just with different weight files (from tables in Level 3).

Each table below demonstrates the pre-compiled model library files for each model architecture. This is categorized by:

  • Size: each size of model has its own distinct model library file (e.g. 7B or 13B number of parameters)

  • Platform: the backend that the model library is intended to be run on (e.g. CUDA, ROCm, iphone, etc.)

  • Quantization scheme: the model library file also differs due to the quantization scheme used. For more on this, please see the quantization page (e.g. q3f16_1 vs. q4f16_1).

Each entry links to the specific model library file found in this github repo.

If the model library you found is not available as a prebuilt, you can compile it yourself by following the model compilation page, and submit a PR to the repo binary-mlc-llm-libs afterwards.

Llama

Llama

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M Chip)

Metal

(Intel)

iOS

Android

webgpu

mali

7B

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f32_1

13B

q4f16_1

q4f16_1

q4f16_1

q4f16_1

34B

70B

q4f16_1

q4f16_1

q4f16_1

q4f16_1

Mistral

Mistral

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M Chip)

Metal

(Intel)

iOS

Android

webgpu

mali

7B

q4f16_1

q4f16_1

q4f16_1

q3f16_1

q4f16_1

q4f16_1

GPT-NeoX (RedPajama-INCITE)

GPT-NeoX (RedPajama-INCITE)

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M Chip)

Metal

(Intel)

iOS

Android

webgpu

mali

3B

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f32_1

q4f16_1

q4f16_1

q4f32_1

q4f16_1

q4f32_1

GPTBigCode

GPTBigCode

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M Chip)

Metal

(Intel)

iOS

Android

webgpu

mali

15B

Phi

Phi

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M Chip)

Metal

(Intel)

iOS

Android

webgpu

mali

Phi-2

(2.7B)

q0f16

q4f16_1

q0f16

q4f16_1

q0f16

q4f16_1

q0f16

q4f16_1

Phi-1.5

(1.3B)

q0f16

q4f16_1

q0f16

q4f16_1

q0f16

q4f16_1

q0f16

q4f16_1

GPT2

GPT2

CUDA

ROCm

Vulkan

(Linux)

Vulkan

(Windows)

Metal

(M Chip)

Metal

(Intel)

iOS

Android

webgpu

mali

GPT2

(124M)

q0f16

q0f16

q0f16

q0f16

GPT2-med

(355M)

q0f16

q0f16

q0f16

q0f16

Level 3: Model Variant Tables (Precompiled Weights)

Finally, for each model variant, we provide the precompiled weights we uploaded to hugging face.

Each precompiled weight is categorized by its model size (e.g. 7B vs. 13B) and the quantization scheme (e.g. q3f16_1 vs. q4f16_1). We note that the weights are platform-agnostic.

Each model variant also loads its conversation configuration from a pre-defined conversation template. Note that multiple model variants can share a common conversation template.

Some of these files are uploaded by our community contributors–thank you!

Llama-2

Conversation template: llama-2

Llama-2

Size

Hugging Face Repo Link

7B

13B

70B

Mistral

Conversation template: mistral_default

Mistral

Size

Hugging Face Repo Link

7B

NeuralHermes-2.5-Mistral

Conversation template: neural_hermes_mistral

Neural Hermes

Size

Hugging Face Repo Link

7B

OpenHermes-2-Mistral

Conversation template: open_hermes_mistral

Open Hermes

Size

Hugging Face Repo Link

7B

WizardMath V1.1

Conversation template: wizard_coder_or_math

WizardMath

Size

Hugging Face Repo Link

7B

RedPajama

Conversation template: redpajama_chat

Red Pajama

Size

Hugging Face Repo Link

3B

Phi

Conversation template: phi-2

Phi

Size

Hugging Face Repo Link

Phi-2 (2.7B)

Phi-1.5 (1.3B)

GPT2

Conversation template: gpt2

GPT2

Size

Hugging Face Repo Link

GPT2 (124M)

GPT2-medium (355M)


Contribute Models to MLC-LLM

Ready to contribute your compiled models/new model architectures? Awesome! Please check Contribute New Models to MLC-LLM on how to contribute new models to MLC-LLM.