Planning your deployment
Before you can run an LLM in production, you first need to make a few key decisions. These early choices will shape your infrastructure needs, costs, and how well the model performs for your use case.
📄️ Serverless vs. self-hosted LLM inference
Understand the differences between serverless LLM APIs and self-hosted LLM deployments.
📄️ Choosing the right model
Select the right models for your use case.
📄️ Choosing the right GPU
Select the right NVIDIA or AMD GPUs (e.g., L4, A100, H100, B200, MI250X, MI300X, MI350X) for LLM inference.
📄️ Calculating GPU memory for serving LLMs
Learn how to calculate GPU memory for serving LLMs.
📄️ Choosing the right inference framework
Select the right inference frameworks for your use case.
📄️ Bring Your Own Cloud (BYOC)
Bring Your Own Cloud (BYOC) is a deployment model where vendors run software in your cloud, combining managed orchestration with complete data control.
📄️ On-prem LLM deployments
On-prem LLMs are large language models deployed within an organization’s own infrastructure, such as private data centers or air-gapped environments. This pattern offers full control over data, models, performance, and cost.