Problem Understanding
Restate the problem in your own words.
Design an LLM Serving Platform (ChatGPT)
Design a ChatGPT-class LLM serving platform: users send chat completions with multi-turn conversation context, the system routes to the right model variant, batches requests on GPUs for throughput, streams tokens back as they generate (TTFT < 1 s), reuses per-conversation KV-cache, and runs moderation on input + output. GPU is the scarce resource — utilisation is the headline cost metric. The decisive trade-offs are static vs continuous batching, per-conversation KV-cache vs prompt-prefix-only, and how to route across model variants (small / large / specialised) for cost-quality optimisation.
- ChatGPT (OpenAI)Conversational LLM at hundreds of millions of users; full serving stack including moderation.
- Claude (Anthropic)Same shape; emphasis on constitutional AI safety layer.
- Gemini (Google)Multi-modal LLM serving stack on Google's TPU fleet.
- GitHub CopilotSpecialised LLM serving for code completion; latency-defining.
- Together AI / ReplicateMulti-model API platforms; same routing + batching primitives.
Your task: read the problem above, then write what the system is, who uses it, the rough scale, and the headline UX expectation — in your own words. Submit for AI review when you're ready.
Click any step in the sidebar to jump around — sections don't have to be done in order. Press ? any time to see all shortcuts.