Some of the engineering problems we ran into:
- GPU cold starts and queue scheduling - Multi-tenant isolation without wasting VRAM - Model loading vs container loading tradeoffs - Batch vs real-time inference routing - Handling burst workloads without long-term GPU reservation - Cost predictability vs autoscaling behavior
We wrote up the architecture decisions, what failed, and what worked.
Happy to answer technical questions - especially around GPU scheduling, inference optimization, and workload isolation.
1 comments