Quick Start¶

Examples¶

To begin with, try out MLC LLM support for int4-quantized Llama3 8B. It is recommended to have at least 6GB free VRAM to run it.

Install MLC LLM. MLC LLM is available via pip. It is always recommended to install it in an isolated conda virtual environment.

Run chat completion in Python. The following Python script showcases the Python API of MLC LLM:

from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

Documentation and tutorial. Python API reference and its tutorials are available online.

https://raw.githubusercontent.com/mlc-ai/web-data/main/images/mlc-llm/tutorials/python-engine-api.jpg

MLC LLM Python API¶

What to Do Next¶