Planning your deployment

Before you can run an LLM in production, you first need to make a few key decisions. These early choices will shape your infrastructure needs, costs, and how well the model performs for your use case.

📄️ Serverless vs. self-hosted LLM inference

Understand the differences between serverless LLM APIs and self-hosted LLM deployments.

📄️ Choosing the right model

Select the right models for your use case.

📄️ Choosing the right GPU

Select the right NVIDIA or AMD GPUs (e.g., L4, A100, H100, B200, MI250X, MI300X, MI350X) for LLM inference.

📄️ Calculating GPU memory for serving LLMs

Learn how to calculate GPU memory for serving LLMs.

📄️ Choosing the right inference framework

Select the right inference frameworks for your use case.

📄️ Bring Your Own Cloud (BYOC)

Bring Your Own Cloud (BYOC) is a deployment model where vendors run software in your cloud, combining managed orchestration with complete data control.

📄️ On-prem LLM deployments

On-prem LLMs are large language models deployed within an organization’s own infrastructure, such as private data centers or air-gapped environments. This pattern offers full control over data, models, performance, and cost.