Projects

Featured

MLC LLM

High-performance, memory-efficient LLM inference across devices and backends. Built with advanced compilation and runtime optimizations for CPUs, GPUs, and mobile.

Universal hardware support

Advanced memory optimization

Dynamic batching support

Python & JavaScript APIs

View on GitHub Documentation

Quick Start

pip install mlc-llm
# Load and run a model
import mlc_llm
model = mlc_llm.load("Llama-2-7b-chat")
response = model.generate("Hello, world!")

Hello, AI!

Hi! I'm running fully in your browser via WebLLM.

Web

WebLLM

In-browser LLM inference on WebGPU with zero server dependency. Ship private, fast AI experiences that run entirely client-side.

WebGPU acceleration

On-device inference

Privacy-preserving

Offline capable

View on GitHub Web Guide

Serving

FlexFlow Serve

Low-latency, high-performance LLM serving built on speculative inference. Tree-based speculative decoding and token tree verification to significantly reduce end-to-end latency while preserving model quality.

Low-latency LLM inference

Tree-based speculative decoding

Token tree verification

Co-serving PEFT & inference

View on GitHub Serving Guide

Clients

Inference/PEFT Requests

FlexFlow Serve

Compiler

Persistent Kernel

Optimization

Mirage

Automated kernel and graph optimization for LLM workloads. Explore schedule search and code generation for maximum performance.

Task Auto-scheduling

MegaKernel generation

Hardware-aware optimization

View on GitHub Mirage Docs

Grammar

XGrammar

Constrained decoding with expressive grammars for structured generation. Produce JSON, SQL, and domain-specific formats reliably.

Deterministic outputs

Grammar-based control

Easy integration

View on GitHub XGrammar Docs

Constrained Decoding

{
  "type": "object",
  "properties": { "name": { "type": "string" } },
  "required": ["name"]
}

Core Projects

MLC LLM

WebLLM

FlexFlow Serve

Mirage

XGrammar