Piyush Choudhari's avatar

Piyush Choudhari

September 24, 2025Hardware

Be Curious About Your Compute

After facing a blocker related to hardware, I decided to deep dive into it: An Explainer

Piyush Choudhari
6 min read
blog
AIML
hardware
GPU
TPU
deployment
Also published on:

Hardware Often Takes The Backseat

I've been a software guy throughout my journey and rarely I've tried to lift the curtain up and focus on the hardware. Those rare times probably happened during my IOT study sessions at my engineering course.

Most of the blockers I've ever faced during my AI engineering journey, have been related to software. Broken drivers, unpatched source code, outdated libraries and of course classic python quirks.

But one day I faced a blocker, obviously not knowing that it was hardware related. At my internship, I was trying to draw inference from an Automatic1111 server on AWS, the server was equipped with two specific diffusion models, one was SDXL (very heavy) another was an SD1.5 (lighter). However drawing inference from one model caused the next inference from the second model to be very slow. My first reaction was that it was because of caching.

But after debugging this issue, I learned that this was expected behavior by Automatic1111 module. It loaded a model into VRAM and "kept it hot" for fast inference then loaded the second model, hence resulting in slow inference.

my story

That day I learnt that understanding and knowing the relation between the software and hardware interactions goes a long way.

"Not thinking about the hardware" is something I observe a lot from peers and engineers. So, I had the idea to write this blog.


Types Of Available Hardware

CPU vs GPU vs TPU

AspectCPUGPUTPU
RoleGeneral-purpose processor, brain of the computerOriginally for graphics, now massively parallel computeCustom ASIC by Google for tensor operations & deep learning
CoresFew powerful cores (2–64, more in servers)Thousands of lightweight cores (A100 ~7,000+)Systolic arrays with Matrix Multiply Units (MXUs)
Execution ModelLow-latency, sequential, strong branch handlingSIMT/SIMD, warp scheduling, high throughputData flows across systolic array; highly specialized instructions
MemoryLarge cache hierarchy, moderate bandwidthHigh-bandwidth VRAM (HBM/GDDR), smaller caches, bulk data optimizedHBM tightly coupled with compute; optimized for DL precision formats
StrengthsVersatile, single-thread performance, task switchingMatrix/vector math, AI/ML training, rendering, high throughputExtremely efficient at AI training/inference, high perf-per-watt
WeaknessesLimited parallelism, not efficient for massive matrix opsPoor at branch-heavy sequential code, higher latency, needs CUDA/OpenCLNot general-purpose, tied to Google ecosystem, less flexible

AI or any computationally expensive workload, requires extremely high throughput as the calculations are extremely straight forward (matmul, gradient averages, dot products, etc...) unlike the calculations the CPU does which includes complex branching logic and minimal latency. Hence GPUs and TPUs are extremely popular choices.

cpu vs gpu vs tpu

GPUs, in particular, have extremely high bandwidths but unlike TPUs, they interface well using CUDA lib and Pytorch or TensorFlow also include high interoperability with them.


Bottlenecks

bottleneck

It is normal to focus on metrics like TFLOPs. But a systems performance and turnaround time is often limited by how quickly data moves and can be accessed, not just how fast can it be processed.

  1. Memory capacity (VRAM) and bandwidth are frequently more important than raw compute power. A processor is inefficient if it sits idle waiting for data. VRAM capacity is a key constraint, as it determines whether a large model, such as an LLM, can fit onto a single GPU. For instance, a high-TFLOP consumer GPU like an RTX 4090 with 24 GB of VRAM cannot train models that require the 80 GB or more offered by datacenter GPUs. VRAM size also limits the training batch size, affecting efficiency.

  2. When scaling to multiple GPUs for larger models, the interconnect, the communication pathway between GPUs could become the main bottleneck. Standard interconnects like PCIe have limited bandwidth (around 64 GB/s), which can get saturated when GPUs synchronize data, leading to diminishing returns when scaling beyond a few GPUs. In contrast, proprietary technologies like NVIDIA’s NVLink provide vastly superior bandwidth (up to 900 GB/s), resulting in efficient scaling for training massive foundation models.

  3. The entire system can be constrained by slow data I/O from storage. If data cannot be loaded from disks to the GPUs quickly enough, even the most powerful hardware will be wasted, creating a foundational bottleneck.


Accelerators

Modern AI accelerators are designed around specific philosophies and each type of accelerator handle different types of workloads.

  • NVIDIA H100 focuses on cutting-edge training with its Transformer Engine and FP8 support, demanding extreme bandwidth and power efficiency for massive LLMs.
  • Google TPU v5p uses a systolic array (MXU) for extreme efficiency in large-scale, matrix-heavy workloads, tightly coupled with Google’s distributed infrastructure.
  • AMD MI300 competes by integrating CPU and GPU components into one package, offering a large unified memory pool — attractive for diverse HPC and AI workloads where flexibility and capacity matter.

These differences reflect the architectural trade-offs between general-purpose flexibility (GPU) and specialized efficiency (TPU), with AMD carving a hybrid path.

Accelerator Comparison

AcceleratorMemoryBandwidthKey Use Case
NVIDIA H10080 GB HBM3~3.0 TB/sCutting-edge LLM training, HPC workloads with extreme throughput needs
AMD MI300X192 GB HBM3~5.3 TB/sLarge-scale AI training, HPC, and very large models (fits datasets in memory)
Google TPU v5p95 GB HBM2e per chip~2.77 TB/sLarge-scale AI training/inference; matrix-heavy workloads in Google ecosystem

Practical Tips

To optimize AI workloads we need to focus on data movement too not just compute speed.

  • Profile your system: The first step is to identify where the bottlenecks are. Is it memory capacity (VRAM), memory bandwidth, interconnect speed between GPUs, or slow data loading from storage?
  • Use mixed precision (AMP): Modern GPUs have specialized Tensor Cores that accelerate FP16/BF16 computations. Using Automatic Mixed Precision (AMP) in frameworks like Pytorch significantly speeds up training by using lower precision for matrix maths while maintaining accuracy with FP32 for sensitive operations like loss calculation. This also reduces VRAM usage.
  • Tune batch size and data loaders: VRAM capacity limits your batch size. A larger batch can improve hardware utilization, but a small one may be forced by memory constraints. Ensure your data loading pipeline from storage to GPU is not the bottleneck, as even the fastest GPU is wasted if it's waiting for data.
  • Leverage optimized frameworks: For multi-GPU training, use libraries like PyTorch's DistributedDataParallel with NCCL backends, or advanced frameworks like DeepSpeed, to efficiently manage gradient synchronization, which is often a key bottleneck. These tools are crucial for scaling effectively.

Conclusion

You don't need to be a GPU engineer or a chip designer to understand the nuances of hardware and the interaction between hardware and software. Only requirements are to keep the hardware in mind while designing systems and to Be Curious About Your Compute