Projects

Explore our comprehensive suite of tools and libraries for high-performance machine learning compilations across diverse hardware platforms.

Core Projects

Featured

MLC LLM

High-performance, memory-efficient LLM inference across devices and backends. Built with advanced compilation and runtime optimizations for CPUs, GPUs, and mobile.

Universal hardware support
Advanced memory optimization
Dynamic batching support
Python & JavaScript APIs
View on GitHub Documentation
Quick Start
pip install mlc-llm
# Load and run a model
import mlc_llm
model = mlc_llm.load("Llama-2-7b-chat")
response = model.generate("Hello, world!")
Hello, AI!
Hi! I'm running fully in your browser via WebLLM.
Web

WebLLM

In-browser LLM inference on WebGPU with zero server dependency. Ship private, fast AI experiences that run entirely client-side.

WebGPU acceleration
On-device inference
Privacy-preserving
Offline capable
View on GitHub Web Guide
Serving

FlexFlow Serve

Low-latency, high-performance LLM serving built on speculative inference. Tree-based speculative decoding and token tree verification to significantly reduce end-to-end latency while preserving model quality.

Low-latency LLM inference
Tree-based speculative decoding
Token tree verification
Co-serving PEFT & inference
View on GitHub Serving Guide
Clients
Inference/PEFT Requests
FlexFlow Serve
FlexFlow Serve
Compiler
Persistent Kernel
Optimization

Mirage

Automated kernel and graph optimization for LLM workloads. Explore schedule search and code generation for maximum performance.

Task Auto-scheduling
MegaKernel generation
Hardware-aware optimization
View on GitHub Mirage Docs
Grammar

XGrammar

Constrained decoding with expressive grammars for structured generation. Produce JSON, SQL, and domain-specific formats reliably.

Deterministic outputs
Grammar-based control
Easy integration
View on GitHub XGrammar Docs
Constrained Decoding
{
  "type": "object",
  "properties": { "name": { "type": "string" } },
  "required": ["name"]
}