LLM Inference & Medical Imaging | India Market | December 2024
This guide provides three configuration options: a budget-compliant build, a stretch-budget build with RTX 4080 Super, and a high-performance recommendation with RTX 4090 that exceeds budget but delivers the expandability and VRAM needed for serious LLM work.
The fundamental challenge is memory. At INT4/GPTQ quantization, a 120B parameter model requires approximately 60-70GB of VRAM plus 20% overhead for KV cache—totaling around 75-85GB during inference. The RTX 4090's 24GB is the largest consumer GPU available, handling at most 30-40B quantized models comfortably or 70B with heavy degradation.
| Quantization Level | 120B Model Size | GPU Requirements |
|---|---|---|
| FP16 | ~240GB | 4× A100 80GB |
| INT8 | ~120GB | 2× A100 80GB |
| INT4/GPTQ | ~60-70GB | Not achievable on consumer hardware |
Medical imaging requirements are far more modest. MONAI, nnU-Net, and TotalSegmentator run effectively on 16-24GB VRAM—making the GPU choice primarily about LLM capability rather than radiology workloads.
This configuration maximizes capability within the strict INR 200,000 constraint. The RTX 4070 Ti Super with 16GB VRAM handles models up to 13-20B parameters with quantization and covers all medical imaging POC requirements comfortably.
Exceeding budget by ₹27,893, this build delivers meaningful LLM performance improvements and handles models up to 25-30B parameters quantized. The 16GB VRAM remains limiting, but faster Tensor cores improve inference speed by ~25%.
This configuration substantially exceeds budget but represents the minimum viable hardware for approaching larger LLM workloads. The RTX 4090's 24GB VRAM enables 30-40B quantized models with headroom, and dual-GPU expansion to 48GB combined becomes possible—sufficient for 70B quantized models.
The RTX 4090's 24GB VRAM and 1,008 GB/s memory bandwidth make it the only consumer card capable of running 30-40B models comfortably. The 4080 Super and 4070 Ti Super share 16GB VRAM—adequate for medical imaging but limiting for LLMs. Professional RTX A4000/A5000 cards offer no advantage at their price points; the A6000 (48GB) at ₹375,000+ is impractical for this budget.
| GPU | VRAM | Memory Bandwidth | LLaMA 8B Q4 | Max Model | Price (₹) |
|---|---|---|---|---|---|
| RTX 4070 Ti Super | 16GB | 672 GB/s | 82 tok/s | ~13-20B Q4 | 73,000 |
| RTX 4080 Super | 16GB | 736 GB/s | 106 tok/s | ~20-25B Q4 | 99,000 |
| RTX 4090 | 24GB | 1,008 GB/s | 128 tok/s | ~30-40B Q4 | 149,000 |
Native AVX-512 support accelerates CPU-based inference operations by 15-20% compared to Intel consumer chips where AVX-512 is disabled. The platform guarantees CPU upgrade support through 2027+ (Zen 5, Zen 6), while Intel's LGA1700 is a dead-end socket.
When GPU VRAM is insufficient, models offload layers to system memory—DDR5's higher bandwidth (up to 200% improvement over DDR4 for certain AI workloads) directly accelerates this process. The Ryzen platform's support for 192GB RAM on select motherboards enables future expansion.
Running 120B models eventually requires multi-GPU configurations. RTX 40-series cards lack NVLink, but tensor parallelism via PCIe works effectively with frameworks like vLLM and ExLlamaV2. Two RTX 4090s (48GB combined) can run 70B quantized models at ~19 tokens/sec—still short of 120B requirements but a practical ceiling for consumer hardware.
| Multi-GPU Config | Combined VRAM | 70B Q4 Performance | Approx. Cost |
|---|---|---|---|
| 2× RTX 4090 | 48GB | 19 tok/s | ₹298,000 (GPUs only) |
| 2× RTX 4080 Super | 32GB | OOM for 70B | ₹198,000 (GPUs only) |
| Single RTX A6000 | 48GB | 14.6 tok/s | ₹375,000+ |
For oncology and radiology POC workloads, even the budget configuration provides substantial headroom. The major frameworks have modest requirements compared to large LLMs.
All three configurations handle these medical imaging workloads without limitation. The GPU choice should therefore be driven primarily by LLM requirements and budget rather than radiology needs.
The RTX 4090 draws 450W TDP with transient spikes to 600W+. Combined with a high-end CPU (170W+ under load) and system overhead, total power draw reaches 800-1000W during inference operations. ATX 3.0 power supplies with native 12VHPWR connectors handle transient spikes up to 3× nominal load—critical for stable operation.
| Configuration | Recommended PSU | Wattage |
|---|---|---|
| RTX 4070 Ti Super + Ryzen 7 | Corsair RM850x | 850W |
| RTX 4080 Super + Ryzen 9 | Corsair RM1000x | 1000W |
| RTX 4090 + Ryzen 9 | Corsair HX1200 | 1200W |
| Future dual GPU | Corsair HX1500i | 1500W+ |
Significant price variation exists between Indian retailers. EliteHubs consistently offers the best GPU and motherboard prices—RTX 4090 at ₹148,945 versus ₹255,000+ on Amazon India. For specific components:
For strict INR 200,000 adherence: Option A delivers a capable POC workstation for medical imaging and models up to 20B parameters. It cannot run 120B models—no consumer hardware within this budget can.
For serious LLM development: Stretch to Option C with the RTX 4090. The ₹80,000+ budget increase purchases 8GB additional VRAM, 50% higher memory bandwidth, and 55% faster inference—differences that fundamentally change what models are practical to run. The expandability to dual GPUs creates a viable path toward 70B models.
For 120B models specifically: The honest answer is that consumer hardware is insufficient. Options include: (1) using quantized 70B models as a proxy during POC development, (2) hybrid inference with partial CPU offloading accepting 1-2 tokens/sec speeds, (3) cloud API access for 120B+ inference, or (4) substantially larger budget for used datacenter GPUs (2× A100 80GB at ₹800,000+).
The recommended path is Option C with the RTX 4090, acknowledging budget overrun, combined with cloud API usage for 120B model validation during POC. This balances local development capability, expandability, and practical access to larger models when required.
Configuration pricing current as of December 2024. Indian market prices fluctuate; verify current rates before purchase. Performance benchmarks derived from community testing on similar hardware configurations.