Strategic Architectural Specification for High-Performance Biomedical AI Workstation: Optimizing for 120B Parameter LLMs and Volumetric Imaging within constrained Budgetary Frameworks

1. Executive Summary

This comprehensive research report presents a rigorous architectural specification and procurement strategy for a dedicated high-performance workstation engineered to bridge the computational gap between Proof-of-Concept (POC) execution of 120-billion parameter Large Language Models (LLMs) and high-fidelity medical image processing. The primary objective is to deliver a viable system configuration within a strictly defined budget of INR 200,000, while adhering to the critical requirement of a robust upgrade path for future scalability.

The analysis reveals that the intersection of two distinct workloads—LLM inference and medical image segmentation—creates a complex optimization problem. Large Language Models, particularly those in the 120B parameter class like gpt-oss-120b, act as memory-bound workloads that necessitate massive capacity and high throughput to store and access model weights. Conversely, medical image processing frameworks such as MONAI (Medical Open Network for AI) and 3D Slicer impose heavy demands on GPU compute capability (CUDA cores) and memory bandwidth for handling high-resolution volumetric data (CT/MRI). Standard consumer-grade hardware configurations typically fail to satisfy this dual mandate due to insufficient Video Random Access Memory (VRAM) and restricted PCIe lane configurations.

The proposed solution necessitates a departure from conventional "off-the-shelf" procurement strategies. By synthesizing data on hardware physics, software optimization techniques (quantization, hybrid inference), and the specific dynamics of the Indian computer hardware market, this report advocates for a Hybrid Compute Architecture. This architecture leverages the cost-efficiency of the secondary market for enterprise-grade consumer GPUs—specifically the NVIDIA GeForce RTX 3090—integrated with a modern AMD AM5 platform that supports PCIe bifurcation. This combination enables the immediate execution of quantized 120B models via CPU-GPU offloading and provides the necessary bandwidth for medical imaging, all while reserving architectural headroom for a dual-GPU configuration in the future.

The recommended configuration centers on a used NVIDIA RTX 3090 24GB GPU paired with an AMD Ryzen 9 7900 processor and a substantial 96GB of DDR5 system memory. This specific alignment of components addresses the critical bottleneck of VRAM capacity, mitigates the latency of CPU-based inference through AVX-512 acceleration, and fits within the INR 200,000 ceiling. The report further details a risk-managed procurement strategy involving trusted Indian vendors such as Suraj Technology and community marketplaces like Zoukart and Techenclave, ensuring that the theoretical performance benefits are realizable in a practical, economically viable build.

2. Computational Theory and Workload Characterization

To engineer a system capable of handling "POC grade" 120B LLMs and professional medical imaging, it is imperative to first deconstruct the fundamental computational physics governing these workloads. Understanding the bottlenecks at the silicon level allows for precise component selection that avoids the common pitfalls of balanced consumer PC builds, which are often optimized for gaming rather than high-performance computing (HPC).

2.1. The Physics of Large Language Model Inference

The execution of a 120-billion parameter model on a workstation represents a significant challenge in memory management. An LLM operates by sequentially predicting the next token based on the preceding context. This process involves matrix-vector multiplication where the model's weights (parameters) must be loaded from memory into the compute units.

Memory Capacity as the Primary Constraint
In its native Half Precision (FP16) format, a 120B model requires 2 bytes per parameter. Consequently, the model alone necessitates approximately 240 GB of VRAM to load the weights.[1] This requirement places the workload firmly in the domain of enterprise data center hardware, such as clusters of NVIDIA A100 (80GB) or H100 GPUs, which are economically inaccessible for this project. To run such a model on a workstation with a budget of INR 200,000, one must utilize Quantization.

Quantization reduces the precision of the model weights from 16-bit floating-point numbers to lower-bit integers, drastically reducing the memory footprint with minimal degradation in perplexity (reasoning capability).

The Hybrid Inference Architecture
Even with quantization, a 75GB model exceeds the 24GB VRAM capacity of the finest consumer GPU available, the RTX 3090 (or RTX 4090). This necessitates a Split-Memory Strategy, also known as CPU Offloading. In this architecture, the model is partitioned:

  1. GPU Layers: As many layers as possible (typically 30-35 layers for a 120B model) are loaded into the GPU's ultra-fast VRAM (GDDR6X).
  2. CPU Layers: The remaining layers (approximately 80+) reside in the system RAM (DDR5).

During inference, the computation flows sequentially through the layers. When the GPU finishes processing its layers, the data is transferred over the PCIe bus to the CPU, which processes the remaining layers using system RAM. This transition introduces a significant bandwidth bottleneck. The GPU memory operates at 936 GB/s (RTX 3090), whereas dual-channel DDR5 system memory operates at approximately 50-70 GB/s. This disparity means that while the GPU portion of the inference is instantaneous, the CPU portion is bounded by memory bandwidth, resulting in generation speeds of 2-5 tokens per second (t/s).[5, 6] While this is too slow for a real-time chatbot experience, it is perfectly acceptable for "POC grade" research, automated analysis, and batch processing tasks specified in the user request.

2.2. Medical Image Computing: Throughput and Bandwidth

Unlike the sequential nature of LLMs, medical image processing is inherently parallel. Workflows involving frameworks like MONAI (Medical Open Network for AI) and 3D Slicer generally involve operations on 3D volumetric datasets (Voxels) derived from CT or MRI scans.

The Bandwidth Imperative
Medical imaging tasks, such as volumetric segmentation (e.g., separating a tumor from healthy tissue using VISTA-3D or UNet models), involve processing massive 3D matrices (tensors). A standard high-resolution CT scan might be a 512x512x512 voxel array. Processing these volumes requires high memory bandwidth to feed the CUDA cores efficiently.
Research indicates a critical divergence in modern GPU architecture that affects this workload. The newer NVIDIA RTX 4060 Ti 16GB, despite having a substantial VRAM buffer, utilizes a narrow 128-bit memory bus, resulting in a memory bandwidth of only 288 GB/s.[7] In stark contrast, the older RTX 3090 utilizes a 384-bit memory bus, delivering 936 GB/s of bandwidth.[8]

Architectural Consequence:
For deep learning training and heavy 3D rendering in MONAI, the bandwidth limitation of the 4060 Ti becomes a severe choke point. Benchmarks suggest that the RTX 3090 can be up to 2x faster in training loops and inference for large medical models compared to the 4060 Ti, purely due to the ability to move data in and out of the compute units more rapidly.[8] Furthermore, complex 3D visualizations in 3D Slicer rely on volume rendering techniques that scale linearly with memory bandwidth.[9] Therefore, for medical imaging, raw bandwidth is as critical as capacity.

Software Stack Dependencies
The medical AI ecosystem is predominantly built on NVIDIA CUDA. Libraries like MONAI, PyTorch, and ITK/VTK (the backbone of 3D Slicer) are heavily optimized for CUDA acceleration.[10, 11] While Apple's Metal Performance Shaders (MPS) have made strides, they still lack support for specific 3D operators required in advanced medical research (e.g., certain 3D convolutions or deformable registration algorithms), often forcing a fallback to the CPU, which drastically slows down the workflow.[12, 13] This reinforces the necessity of an NVIDIA-based architecture for this specific use case.

3. Architectural Platform Analysis and Selection

To satisfy the constraints of budget (INR 200,000) and capability (120B LLM + Medical AI), three primary hardware platforms were evaluated.

3.1. Option A: Apple Mac Mini (M4 Pro) - The Unified Memory Contender

The Mac Mini with the M4 Pro chip utilizes a Unified Memory Architecture (UMA), where the CPU and GPU share a single pool of high-speed memory.

3.2. Option B: Intel LGA1700 (13th/14th Gen) - The Dead End

3.3. Option C: AMD AM5 (Ryzen 7000/9000) - The Strategic Choice

The AMD AM5 platform emerges as the optimal foundation for this workstation.

Conclusion: The AMD AM5 platform is selected as the architectural basis for this workstation.

4. Component Selection and Procurement Strategy

This section details the specific components selected to meet the technical requirements within the INR 200,000 budget, leveraging the nuances of the Indian hardware market.

4.1. The GPU: Used NVIDIA GeForce RTX 3090 24GB

The GPU is the most critical component. A new RTX 4090 (24GB) costs ~INR 1,80,000, consuming the entire budget. The RTX 4060 Ti (16GB) lacks the bandwidth for medical imaging and the VRAM for effective LLM offloading. Therefore, the used market is the only viable route.

4.2. The CPU: AMD Ryzen 9 7900 (Non-X)

4.3. The Motherboard: The Bifurcation Enabler

This component dictates the upgrade path. Most budget B650 motherboards feature one PCIe x16 slot wired to the CPU, while the second "x16" slot is electrically x4 and wired to the chipset. This is insufficient for a dual-GPU setup, as the second card would be severely bottlenecked.

4.4. System Memory (RAM): 96GB DDR5 (2x48GB)

Standard memory configurations (32GB/64GB) are mathematically insufficient.

4.5. Storage: High-Throughput NVMe

Loading a 75GB model file into memory takes time. Slow storage results in frustratingly long startup latencies for every inference session.

4.6. Power Supply Unit (PSU): 1000W Gold

The RTX 3090 is notorious for "transient spikes"—microsecond bursts where power draw can exceed 500W. A substandard PSU will trigger over-current protection (OCP) and crash the system.

4.7. Chassis and Cooling

5. Comprehensive Bill of Materials (BOM)

The following table summarizes the optimized component list, pricing estimates based on current Indian market data, and sourcing channels.

Component Category Specific Selection Estimated Price (INR) Sourcing Channel & Rationale
GPU Used NVIDIA RTX 3090 24GB ₹55,000 Zoukart / Suraj Technology / LebyoPC. Essential for 24GB VRAM and 936 GB/s bandwidth.
CPU AMD Ryzen 9 7900 ₹34,000 MDComputers / Amazon. 12-Core efficiency, AVX-512 for LLM acceleration.
Motherboard ASUS ProArt B650-Creator ₹25,000 Micro Center India / Vedant. Validated x8/x8 bifurcation for dual-GPU upgrade path.
Memory 96GB (2x48GB) DDR5 5600MHz ₹34,000 PrimeABGB / Computech. High density non-binary RAM for model offloading stability.
Storage Kingston KC3000 2TB Gen4 ₹13,500 OnlySSD / Vedant. 7000MB/s speeds for rapid model and dataset loading.
Power Supply Deepcool PQ1000M 1000W Gold ₹12,000 EliteHubs / Amazon. Seasonic OEM platform to handle 3090 transient spikes.
Chassis Lian Li Lancool 216 ₹8,500 EzPz Solutions / MDComputers. Superior airflow for thermal management.
CPU Cooler Deepcool AK620 Zero Dark ₹5,500 Amazon / Vedant. Robust air cooling for sustained workstation loads.
Total Estimated Cost ₹1,87,500 Remains ~₹12,500 below the ₹2,00,000 limit.

Note: The remaining buffer of ~₹12,500 serves as a contingency for shipping costs, potential price fluctuations in the used GPU market, or the addition of a secondary 4TB HDD for cold storage of medical archives.

6. Technical Implementation and System Optimization

Hardware procurement is only the first phase. The viability of this workstation relies heavily on specific software configurations to harmonize the split-memory architecture.

6.1. Optimizing 120B LLM Inference

Running a model of this magnitude on "prosumer" hardware requires the use of llama.cpp or Ollama, which are optimized for hybrid CPU-GPU inference.

  1. Quantization Strategy: Users must utilize the GGUF format. Specifically, the gpt-oss-120b-Q4_K_M.gguf quantization is recommended. This file size is approximately 72GB.
  2. Layer Offloading Configuration:
    • The RTX 3090 provides 24GB of VRAM. After accounting for OS overhead (approx. 600MB-1GB on a headless Linux server, or 2GB on Windows), roughly 22GB is available for the model.
    • Configuration: Using the --n-gpu-layers flag in llama.cpp, users should offload approximately 30 to 35 layers to the GPU.
    • The CPU's Role: The remaining ~80 layers will reside in the 96GB system RAM. The Ryzen 9 7900 will process these layers using its AVX-512 instructions.
  3. Performance Expectations:
    • Prefill (Prompt Processing): This phase is parallelizable and will benefit from the GPU's initial ingest, offering reasonable speed.
    • Decode (Token Generation): This phase is memory-bandwidth bound. Since a significant portion of the model is in system RAM, the generation speed will be limited by the DDR5 bandwidth (~60 GB/s). Users should expect a generation speed of 2 to 4 tokens per second. While slower than a pure GPU setup, this is fully functional for POC testing, chain-of-thought verification, and automated agentic workflows.[5, 6]

6.2. Medical Imaging Stack Configuration

For medical image analysis, the software stack must be configured to prioritize the GPU.

  1. CUDA Toolkit Compatibility: Ensure the installed NVIDIA drivers and CUDA Toolkit version match the requirements of the specific MONAI release (e.g., CUDA 12.x for MONAI v1.3+).
  2. 3D Slicer Configuration:
    • In 3D Slicer settings, ensure "Volume Rendering" is set to use the GPU.
    • The 24GB VRAM allows for the loading of multiple high-resolution series simultaneously, a capability that 8GB or 12GB cards lack.
  3. MONAI Inference:
    • For segmentation tasks on large volumes (e.g., 512x512x512), use Sliding Window Inference (a standard MONAI feature). This technique breaks the large volume into smaller chunks that fit within the GPU memory, processes them, and stitches the results back together. The RTX 3090's capacity allows for larger window sizes, reducing the number of "stitches" and speeding up the overall process.[10]

6.3. Operating System Recommendation

Linux (Ubuntu 22.04/24.04 LTS) is strongly recommended over Windows 11.

7. Future Expansion Strategy (The Upgrade Path)

The chosen architecture is not a dead end; it is a foundation for a Phase 2 workstation.

  1. Dual GPU Expansion:
    • The ASUS ProArt B650-Creator and the 1000W PSU are selected specifically to accommodate a second RTX 3090 in the future.
    • Installation: The second card can be slotted into the secondary PCIe x16 slot. The motherboard will automatically bifurcate the bandwidth to x8/x8. While x8 is half the bandwidth of x16, for LLM inference (which is memory capacity bound) and many medical imaging tasks, the performance penalty is negligible compared to the gain of doubling VRAM to 48GB.
    • Impact: With 48GB of VRAM, significantly more layers of the 120B model can be offloaded to the GPU, potentially doubling inference speeds to 8-10 tokens per second.
  2. CPU Upgradability:
    • The AM5 socket ensures that in 2-3 years, the user can swap the Ryzen 9 7900 for a Ryzen 9000 or later series processor to gain IPC improvements, without needing to replace the RAM or motherboard.

8. Procurement Risks and Mitigation in India

Sourcing used hardware in India requires a strategic approach to mitigate risk.

  1. Vendor Verification: When dealing with sources like Suraj Technology or Zoukart sellers:
    • Mandatory Testing: Demand a video call or a recorded video showing the specific GPU serial number running a FurMark stress test for at least 15 minutes. Monitor the "Hotspot Temperature"—on a used 3090, this should not exceed 105°C (thermal throttling limit). Ideally, it should stay under 95°C.
    • Benchmark Validation: Request a 3DMark Time Spy run. A healthy RTX 3090 should score approximately 19,000 to 20,000 graphics points. A significantly lower score indicates thermal throttling or a degraded card (former mining card).
  2. Warranty & Returns:
    • Prioritize sellers who offer a "testing warranty" (typically 7 to 30 days).
    • Use payment methods that offer some protection (e.g., the "Admin Method" on Zoukart/Facebook groups, where a trusted admin holds the money).
    • Verify if the card carries any remaining manufacturer warranty. Brands like Zotac are sometimes more lenient with warranty transfers if the original bill is provided, whereas MSI/Gigabyte often strictly follow the serial number and purchase date.[17, 20]

9. Conclusion

This report delineates a workstation architecture that defies conventional "balanced build" logic to satisfy an extreme set of requirements within a constrained budget. By strategically selecting a used NVIDIA RTX 3090, the system secures the 24GB VRAM and 936 GB/s bandwidth that are non-negotiable for medical imaging—capabilities that similarly priced new GPUs like the RTX 4060 Ti fails to deliver. By coupling this with an AM5 Ryzen 9 7900 and 96GB of RAM, the system provides the massive memory buffer required to execute 120B LLMs via hybrid inference, leveraging AVX-512 for acceptable performance.

This configuration is not merely a collection of parts; it is a calculated integration designed to punch far above its weight class. It delivers a functional, high-performance environment for 120B model research and medical AI today, while embedding a clear, hardware-supported pathway to dual-GPU workstation performance in the future.