AI Workstation Build Guide

LLM Inference & Medical Imaging | India Market | December 2024

CRITICAL CONSTRAINT: Running 120B parameter models is not feasible within INR 200,000. A 120B dense model requires 60-120GB+ VRAM even with aggressive 4-bit quantization—far exceeding any consumer GPU's capacity. However, a capable POC workstation for 30-40B quantized models and comprehensive medical imaging workloads is achievable within budget, with a clear upgrade path toward larger models.

This guide provides three configuration options: a budget-compliant build, a stretch-budget build with RTX 4080 Super, and a high-performance recommendation with RTX 4090 that exceeds budget but delivers the expandability and VRAM needed for serious LLM work.

The 120B Model Problem: VRAM Math Doesn't Lie

The fundamental challenge is memory. At INT4/GPTQ quantization, a 120B parameter model requires approximately 60-70GB of VRAM plus 20% overhead for KV cache—totaling around 75-85GB during inference. The RTX 4090's 24GB is the largest consumer GPU available, handling at most 30-40B quantized models comfortably or 70B with heavy degradation.

Quantization Level 120B Model Size GPU Requirements
FP16 ~240GB 4× A100 80GB
INT8 ~120GB 2× A100 80GB
INT4/GPTQ ~60-70GB Not achievable on consumer hardware

Medical imaging requirements are far more modest. MONAI, nnU-Net, and TotalSegmentator run effectively on 16-24GB VRAM—making the GPU choice primarily about LLM capability rather than radiology workloads.

Configuration Option A: Budget-Compliant Build at ₹187,029

Best Value Configuration ₹187,029

This configuration maximizes capability within the strict INR 200,000 constraint. The RTX 4070 Ti Super with 16GB VRAM handles models up to 13-20B parameters with quantization and covers all medical imaging POC requirements comfortably.

GPU
MSI RTX 4070 Ti Super Ventus 3X OC 16GB
₹73,000
CPU
AMD Ryzen 7 7700X (8C/16T, AVX-512)
₹27,985
Motherboard
MSI X670E Gaming Plus WiFi
₹19,999
RAM
G.Skill Ripjaws S5 64GB DDR5-6000 CL32
₹19,045
Storage
Samsung 990 Pro 2TB NVMe
₹18,000
PSU
Corsair RM1000x ATX 3.0 (1000W)
₹17,000
Case
Corsair 4000D Airflow
₹12,000
TOTAL
₹187,029
Performance Expectations: LLaMA 3 8B Q4 at ~82 tokens/sec | Mistral 7B FP16 inference | Medical CT/MRI segmentation with TotalSegmentator at ~30-60 seconds per scan | Maximum practical model size: approximately 20B quantized

Configuration Option B: Stretch Build with RTX 4080 Super at ₹227,893

Enhanced Performance ₹227,893

Exceeding budget by ₹27,893, this build delivers meaningful LLM performance improvements and handles models up to 25-30B parameters quantized. The 16GB VRAM remains limiting, but faster Tensor cores improve inference speed by ~25%.

GPU
ZOTAC RTX 4080 Super Trinity OC 16GB
₹98,999
CPU
AMD Ryzen 9 7900X (12C/24T, 5.6GHz)
₹34,850
Motherboard
MSI X670E Gaming Plus WiFi
₹19,999
RAM
G.Skill Ripjaws S5 64GB DDR5-6000 CL32
₹19,045
Storage
Samsung 990 Pro 2TB NVMe
₹18,000
PSU
Corsair RM1000x ATX 3.0 (1000W)
₹17,000
Case
Corsair 4000D Airflow
₹12,000
Cooling
240mm AIO (Corsair H100i or similar)
₹8,000
TOTAL
₹227,893
Key Improvements: The Ryzen 9 7900X's 12 cores significantly accelerate data preprocessing and tokenization, while the larger L3 cache improves CPU offloading performance when model weights exceed GPU VRAM.

Configuration Option C: High-Performance Build with RTX 4090 at ₹293,425

Maximum Capability ₹293,425

This configuration substantially exceeds budget but represents the minimum viable hardware for approaching larger LLM workloads. The RTX 4090's 24GB VRAM enables 30-40B quantized models with headroom, and dual-GPU expansion to 48GB combined becomes possible—sufficient for 70B quantized models.

GPU
INNO3D RTX 4090 Gaming X3 24GB
₹148,945
CPU
AMD Ryzen 9 7900X (12C/24T)
₹34,850
Motherboard
ASUS TUF Gaming X670E-Plus WiFi
₹32,585
RAM
64GB DDR5-6000 (2×32GB)
₹19,045
Storage
Samsung 990 Pro 2TB NVMe
₹18,000
PSU
Corsair HX1200 Platinum (1200W)
₹25,000
Case
Lian Li Lancool III (full tower)
₹15,000
TOTAL
₹293,425
Performance Benchmarks: LLaMA 3 8B Q4 at 127.7 tokens/sec | 70B Q4 runs at 9-12 tokens/sec with partial CPU offload | TotalSegmentator completes CT segmentation in 30 seconds | A second RTX 4090 can be added later for 48GB combined VRAM, enabling 70B models at ~19 tokens/sec via tensor parallelism

Component Selection Rationale

GPU Analysis: Stark Trade-offs

The RTX 4090's 24GB VRAM and 1,008 GB/s memory bandwidth make it the only consumer card capable of running 30-40B models comfortably. The 4080 Super and 4070 Ti Super share 16GB VRAM—adequate for medical imaging but limiting for LLMs. Professional RTX A4000/A5000 cards offer no advantage at their price points; the A6000 (48GB) at ₹375,000+ is impractical for this budget.

GPU VRAM Memory Bandwidth LLaMA 8B Q4 Max Model Price (₹)
RTX 4070 Ti Super 16GB 672 GB/s 82 tok/s ~13-20B Q4 73,000
RTX 4080 Super 16GB 736 GB/s 106 tok/s ~20-25B Q4 99,000
RTX 4090 24GB 1,008 GB/s 128 tok/s ~30-40B Q4 149,000

AMD AM5 Platform: Best Foundation for AI

Native AVX-512 support accelerates CPU-based inference operations by 15-20% compared to Intel consumer chips where AVX-512 is disabled. The platform guarantees CPU upgrade support through 2027+ (Zen 5, Zen 6), while Intel's LGA1700 is a dead-end socket.

64GB DDR5: Minimum Viable RAM

When GPU VRAM is insufficient, models offload layers to system memory—DDR5's higher bandwidth (up to 200% improvement over DDR4 for certain AI workloads) directly accelerates this process. The Ryzen platform's support for 192GB RAM on select motherboards enables future expansion.

Multi-GPU Expandability Considerations

Running 120B models eventually requires multi-GPU configurations. RTX 40-series cards lack NVLink, but tensor parallelism via PCIe works effectively with frameworks like vLLM and ExLlamaV2. Two RTX 4090s (48GB combined) can run 70B quantized models at ~19 tokens/sec—still short of 120B requirements but a practical ceiling for consumer hardware.

Multi-GPU Config Combined VRAM 70B Q4 Performance Approx. Cost
2× RTX 4090 48GB 19 tok/s ₹298,000 (GPUs only)
2× RTX 4080 Super 32GB OOM for 70B ₹198,000 (GPUs only)
Single RTX A6000 48GB 14.6 tok/s ₹375,000+

Medical Imaging Workload Analysis

For oncology and radiology POC workloads, even the budget configuration provides substantial headroom. The major frameworks have modest requirements compared to large LLMs.

All three configurations handle these medical imaging workloads without limitation. The GPU choice should therefore be driven primarily by LLM requirements and budget rather than radiology needs.

PSU and Thermal Requirements

The RTX 4090 draws 450W TDP with transient spikes to 600W+. Combined with a high-end CPU (170W+ under load) and system overhead, total power draw reaches 800-1000W during inference operations. ATX 3.0 power supplies with native 12VHPWR connectors handle transient spikes up to 3× nominal load—critical for stable operation.

Configuration Recommended PSU Wattage
RTX 4070 Ti Super + Ryzen 7 Corsair RM850x 850W
RTX 4080 Super + Ryzen 9 Corsair RM1000x 1000W
RTX 4090 + Ryzen 9 Corsair HX1200 1200W
Future dual GPU Corsair HX1500i 1500W+

Recommended Purchasing Channels in India

Significant price variation exists between Indian retailers. EliteHubs consistently offers the best GPU and motherboard prices—RTX 4090 at ₹148,945 versus ₹255,000+ on Amazon India. For specific components:

Best Retailers:
  • GPUs: EliteHubs, Computech Store
  • CPUs: Computech Store, Shivam IT
  • Motherboards/RAM: EliteHubs, MD Computers
  • Storage/PSU: Vedant Computers, PC Studio, PrimeABGB
AVOID: Amazon and Flipkart for GPUs—prices often run 30-80% higher than specialty retailers.

Practical Path Forward

For strict INR 200,000 adherence: Option A delivers a capable POC workstation for medical imaging and models up to 20B parameters. It cannot run 120B models—no consumer hardware within this budget can.

For serious LLM development: Stretch to Option C with the RTX 4090. The ₹80,000+ budget increase purchases 8GB additional VRAM, 50% higher memory bandwidth, and 55% faster inference—differences that fundamentally change what models are practical to run. The expandability to dual GPUs creates a viable path toward 70B models.

For 120B models specifically: The honest answer is that consumer hardware is insufficient. Options include: (1) using quantized 70B models as a proxy during POC development, (2) hybrid inference with partial CPU offloading accepting 1-2 tokens/sec speeds, (3) cloud API access for 120B+ inference, or (4) substantially larger budget for used datacenter GPUs (2× A100 80GB at ₹800,000+).

The recommended path is Option C with the RTX 4090, acknowledging budget overrun, combined with cloud API usage for 120B model validation during POC. This balances local development capability, expandability, and practical access to larger models when required.

Configuration pricing current as of December 2024. Indian market prices fluctuate; verify current rates before purchase. Performance benchmarks derived from community testing on similar hardware configurations.