Easiest way to run a local LLM. Pair with Open WebUI for full ChatGPT replacement.
Run Llama/Mistral/Gemma/DeepSeek locallyOllama-compatible API for any clientBrew/winget/curl install
install: brew install ollama && ollama run llama3.3
caveat: Needs 16GB+ RAM for usable models. GPU strongly recommended.
The default ChatGPT-replacement self-host. Massive ecosystem.
ChatGPT-style web UI for any backend (Ollama, OpenAI-compatible)RAG with file uploadsFunction calling, web search plugin
install: docker run -p 3000:8080 ghcr.io/open-webui/open-webui:main
caveat: Best paired with Ollama running locally.
Permissive (MIT) open-weight model targeting ChatGPT/Claude-class chat and agentic coding without API lock-in - the strongest open option for a self-hosted stack.
Fully-open (MIT) flagship MoE from Z.ai / Zhipu AI (weights released Jun 2026)~744B total / ~40B active parameters, 1M-token context (~5x GLM-5.1)Strongest open-weight coding model at release - #2 on Code Arena, ~#3 on FrontierSWE behind only Fable 5 / Opus 4.8Tool use + reasoning, served via vLLM/SGLang
install: ollama run glm-5.2 (or download weights from HuggingFace: zai-org/GLM-5.2, serve via vLLM/SGLang)
caveat: MIT, fully open, no usage restrictions. At ~744B params it is multi-GPU / data-center scale - use a hosted provider if you lack the hardware. Note: the hosted Z.ai API routes through China (a data-residency consideration); self-hosting the open weights avoids that.
Drop-in open-weight replacement for the OpenAI/Anthropic chat APIs. Flash is the locally-runnable variant; serve it behind Open WebUI.
MIT-licensed open-weight MoE: V4-Flash (284B/13B active) + V4-Pro (1.6T/49B active)1M-token contextNear-frontier quality at a fraction of API costRun locally or self-host the backend
install: Download from HuggingFace (deepseek-ai/DeepSeek-V4-Flash); serve via vLLM/SGLang
caveat: Even Flash (284B) needs serious VRAM/multi-GPU; Pro (1.6T) is data-center scale. MIT license, weights fully open.
Unrestricted (Apache-2.0) open-weight model to replace the OpenAI/Anthropic/Gemini chat APIs in a self-hosted stack.
Apache-2.0 open-weight MoE from Mistral AI: 675B total / 41B active256K context, multimodal (text + vision)Production-grade general-purpose modelServe via vLLM/SGLang or run through Ollama
install: ollama run mistral-large-3 (or download weights from HuggingFace: mistralai/Mistral-Large-3-675B-Instruct-2512-BF16, serve via vLLM/SGLang)
caveat: Apache-2.0, no usage restrictions. December 2025 release. At 675B params it is multi-GPU / data-center scale - use a hosted provider if you lack the hardware.