The Infrastructure Mismatch: Why High-Throughput AI Workloads Need Full-Stack Optimization

Author: Quantum Encoding Research Team Date: November 2025 Status: Draft v0.2 (Revised)

Abstract

The prevailing approach to high-throughput AI inference workloads—deploying powerful GPUs or TPUs in virtualized cloud environments—suffers from a fundamental infrastructure mismatch. Through empirical benchmarking across five compute platforms (A100 GPUs, TPU t5-litepod, AMD EPYC CPUs, and consumer GPUs), we demonstrate that commercial GPU instances are “racecar engines in buses”: massive compute power constrained by CPU preprocessing bottlenecks and virtual disk I/O limitations.

Our findings show that A100 GPUs achieve less than 50 images/sec on background removal workloads, while a TPU t5-litepod achieves 264 images/sec and an AMD EPYC CPU achieves 238 images/sec—a 5x performance advantage for non-GPU architectures. We argue that the industry is solving the wrong problem: rather than adding more GPU compute, organizations need full-stack optimization combining CPU-first preprocessing, bare metal I/O, and selective TPU/GPU acceleration.

This paper presents our benchmark methodology, analyzes the bottlenecks in commercial GPU offerings, and proposes a hybrid architecture that achieves 5-10x cost reduction for high-throughput AI workloads. We provide evidence from production deployments processing billions of tokens and millions of images, demonstrating that Google Cloud’s Axion CPUs and TPUs represent the optimal infrastructure for next-generation AI applications.

1. Introduction: The GPU-First Fallacy

1.1 The Current Paradigm

The artificial intelligence industry has converged on a single architectural assumption: GPUs are the solution to AI performance problems. This belief is reinforced by:

Training Dominance: GPUs excel at training large models due to their massive parallel matrix multiplication capabilities
Marketing Narrative: GPU vendors position their hardware as the universal solution for all AI workloads
Developer Defaults: Frameworks like PyTorch and TensorFlow optimize for GPU execution by default

However, this GPU-first paradigm creates a critical blindspot: most production AI workloads are not training workloads. They are high-throughput inference pipelines where the bottleneck is not matrix multiplication, but data preprocessing and I/O.

1.2 The Real-World Problem

Consider a production AI system processing:

10 million images per day for background removal
50 billion tokens per day for document intelligence
Real-time video streams requiring 30+ frames per second

These workloads share common characteristics:

High data volume: Gigabytes to terabytes of raw input
Preprocessing-heavy: Image decoding, resizing, normalization, tokenization
I/O-intensive: Reading from disk, network, or object storage
Cost-sensitive: Running 24/7 at scale

Commercial GPU instances are optimized for batch training jobs (large models, small datasets, long-running) not high-throughput inference (small models, massive datasets, millisecond latency).

This is the infrastructure mismatch.

2. Empirical Benchmarks: Six Platforms, One Workload

2.1 Methodology

We benchmarked a single workload—AI-powered background removal using U2Net-P—across six compute platforms:

Model Specifications:

Model: U2Net-P (portrait segmentation)
Input: 512x512 RGB images
Output: Alpha mask (512x512 single channel)
Framework: ONNX Runtime, TensorFlow, PyTorch (platform-dependent)

Benchmark Protocol:

Process 10,000 images from disk
Measure end-to-end throughput (images/sec)
Monitor CPU utilization, memory, disk I/O
Identify bottlenecks via profiling

Platforms Tested:

NVIDIA A100 80GB (Google Cloud)
NVIDIA A100 40GB (Google Cloud)
Google TPU t5-litepod (Google Cloud)
AMD EPYC c4d-highcpu-384 (Google Cloud, 384 cores)
NVIDIA RTX 3050 (consumer laptop, Rust/ONNX)

2.2 Results

Platform	Images/Sec	CPU Util	Primary Bottleneck	Cost/1M Images
A100 80GB	<50	95%+	CPU preprocessing, virtual disk I/O	$185
A100 40GB	<50	95%+	CPU preprocessing, virtual disk I/O	$142
TPU t5-litepod	264	N/A	Virtual disk I/O	$28
AMD EPYC 384-core	238	45%	None (CPU headroom available)	$31
RTX 3050 (laptop)	173	40%	None (1GB RAM, 2.5GB VRAM)	N/A

Key Findings:

GPU Underperformance: Both A100 instances achieved <50 images/sec despite having 40GB-80GB of VRAM and massive compute capacity. CPU profiling showed 95%+ utilization on preprocessing tasks (image decoding, resizing), indicating the GPU was starved for data.
TPU Excellence: The t5-litepod achieved 264 images/sec—5.3x faster than A100s—at 15% of the cost. Profiling showed the TPU was I/O-bound (virtual disk reads), suggesting bare metal would achieve >400 images/sec.
CPU Competitiveness: The AMD EPYC 384-core achieved 238 images/sec with only 45% CPU utilization, indicating significant headroom. This demonstrates that mature CPU tooling + optimized code can match GPU performance for inference workloads.
Consumer GPU Efficiency: A laptop RTX 3050 with Rust/ONNX achieved 173 images/sec using only 1GB RAM and 2.5GB VRAM, proving that efficient software design matters more than hardware specs for inference.

3. Bottleneck Analysis: Why GPUs Fail at High-Throughput Inference

3.1 The “Racecar Engine in a Bus” Problem

Commercial GPU instances suffer from three architectural mismatches:

3.1.1 CPU Preprocessing Bottleneck

The Problem: Before an image reaches the GPU, it must be:

Read from disk (I/O-bound)
Decoded from JPEG/PNG (CPU-bound)
Resized to model input dimensions (CPU-bound)
Normalized and converted to tensor format (CPU-bound)

The Reality: On A100 instances, the CPU is a 2-8 core Intel/AMD chip optimized for compatibility, not throughput. It cannot feed the GPU fast enough.

Profiling Data (A100 80GB):

Total time per image: 20ms
  - Disk read: 5ms (25%)
  - JPEG decode: 8ms (40%)
  - Resize: 4ms (20%)
  - GPU inference: 2ms (10%)
  - Postprocessing: 1ms (5%)

Analysis: The GPU is idle 90% of the time, waiting for the CPU to prepare data.

3.1.2 Virtual Disk I/O Limitation

The Problem: Commercial GPU instances use network-attached virtual disks (persistent disks, EBS volumes) optimized for:

Durability (replicated across zones)
Flexibility (hot-attach/detach)
Multi-tenancy (fair resource sharing)

NOT optimized for:

Sequential throughput (limited to ~1-2 GB/sec)
Random IOPS (limited to ~10K-30K IOPS)
Latency (network overhead adds milliseconds)

The Reality: Processing 10,000 images requires reading ~5GB of data. On a virtual disk at 1.5 GB/sec, this alone takes 3.3 seconds—before any compute happens.

Comparison:

Virtual disk (persistent disk): 1.5 GB/sec
Bare metal NVMe SSD: 7 GB/sec (4.7x faster)
RAM disk: 50+ GB/sec (33x faster)

3.1.3 Infrastructure Optimization Mismatch

The Problem: Commercial GPU instances are designed for:

Training workloads: Large batch sizes (32-256), long-running jobs (hours-days), infrequent I/O
Batch inference: Accumulate requests, process in large batches, amortize overhead

NOT designed for:

High-throughput streaming inference: Process individual items as fast as possible
Real-time latency: Sub-100ms per-item processing
Sustained throughput: Billions of items per day

The Mismatch: You’re paying for a Ferrari (A100 compute) but driving on a dirt road (slow CPU + virtual disk).

3.2 The TPU Advantage: Purpose-Built Infrastructure

Why TPU t5-litepod achieved 264 images/sec:

Systolic Array Architecture: TPUs are matrix multiplication specialists. Background removal (U2Net-P) is mostly convolution operations—perfectly suited for TPUs.
Integrated Preprocessing: TPU pods include more balanced CPU resources compared to GPU instances.
Lower Overhead: TensorFlow/JAX on TPU has less framework overhead than PyTorch on GPU.
Better I/O: Even with virtual disks, TPU instances have higher IOPS allocation.

Remaining Bottleneck: Virtual disk I/O (same as GPU). On bare metal TPU with local NVMe, we estimate >400 images/sec would be achievable.

3.3 The CPU Surprise: Mature Tooling Wins

Why AMD EPYC 384-core achieved 238 images/sec:

No GPU Transfer Overhead: Data stays in CPU memory—no PCIe transfers, no host-device synchronization.
Massive Parallelism: 384 cores can process 384 images simultaneously. Each core handles the full pipeline (decode, resize, inference, encode).
Mature Tooling: Rust + ONNX Runtime on x86 has been optimized for years. SIMD intrinsics (AVX-512) are well-tuned.
Memory Bandwidth: High core count CPUs have massive memory bandwidth (TB/sec aggregate), eliminating memory bottlenecks.

Key Insight: For workloads where the model is small (<100MB) and the preprocessing is complex, CPUs with mature software can outperform GPUs due to reduced architectural friction.

4. The Axion Tokenizer: Evidence of ARM Potential

While Axion underperformed on image processing due to immature tooling, tokenization benchmarks demonstrate the architecture’s true potential:

4.1 Tokenization Benchmark Results

Platform	Cores	Tokens/Sec	Scaling Efficiency
Axion	16	11.6M	725K tokens/core
Axion	72	51M	708K tokens/core

Scaling Analysis:

72-core is 4.5x more cores than 16-core
Performance increased 4.4x (51M / 11.6M)
98% scaling efficiency (4.4 / 4.5)

4.2 What This Proves

Near-Linear Scaling: Axion achieves 98% scaling efficiency up to 72 cores, proving the architecture has no fundamental bottleneck.
Mature Tooling = Performance: Tokenization uses mature Rust crates (tokenizers) that compile cleanly for ARM64. When software is optimized, Axion performs exceptionally.
World-Class Throughput: 51M tokens/sec on a single machine is faster than most distributed tokenization clusters.

4.3 Projection: TPU Tokenizer

If we built a TPU-accelerated tokenizer (preprocessing on Axion, tokenization on TPU), we estimate:

Axion 72-core: 51M tokens/sec (baseline)
TPU t5-litepod: >150M tokens/sec (3x improvement via hardware acceleration)
TPU v4/v5 pod: >500M tokens/sec (distributed processing)

Use Case: Processing 50 billion tokens/day would require:

Traditional CPU: ~10 machines
Axion + TPU: 2-3 machines (5x cost reduction)

5. The Hybrid Architecture Solution

5.1 Design Principles

Based on our benchmarks, we propose a three-tier hybrid architecture:

Tier 1: CPU-First Preprocessing (Axion or AMD EPYC)

Responsibility: I/O, decoding, resizing, normalization, tokenization
Hardware: High core count CPUs (72+ cores) with mature tooling
Why: CPUs handle irregular, branching workloads better than accelerators

Tier 2: Accelerated Inference (TPU or GPU)

Responsibility: Matrix multiplication (convolutions, attention, FFN layers)
Hardware: TPUs for batch workloads, GPUs for dynamic workloads
Why: Specialized hardware excels at regular, parallelizable compute

Tier 3: Bare Metal I/O (NVMe SSD or RAM Disk)

Responsibility: Storage layer with maximum throughput and minimum latency
Hardware: Local NVMe (7+ GB/sec), RAM disk (50+ GB/sec) for hot data
Why: Virtual disks are the #1 bottleneck in our benchmarks

5.2 Reference Architecture

┌─────────────────────────────────────────────────────┐
│                Application Layer                    │
│         (Python/Rust orchestration)                 │
└────────────┬────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│           CPU Preprocessing Layer                   │
│  - Axion 72-core or AMD EPYC 384-core               │
│  - Parallel I/O workers (read from NVMe)            │
│  - Image decode, resize, normalize                  │
│  - Tokenization, text preprocessing                 │
│  - Batch formation and scheduling                   │
└────────────┬────────────────────────────────────────┘
             │ (batched tensors)
             ▼
┌─────────────────────────────────────────────────────┐
│          Accelerator Inference Layer                │
│  - TPU t5-litepod or GPU (A100/H100)                │
│  - Model inference only (no preprocessing)          │
│  - Optimized batch sizes (64-256)                   │
└────────────┬────────────────────────────────────────┘
             │ (predictions)
             ▼
┌─────────────────────────────────────────────────────┐
│          CPU Postprocessing Layer                   │
│  - Decode predictions, format outputs               │
│  - Write results to storage or API                  │
└─────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│              Storage Layer                          │
│  - Bare metal NVMe SSD (input data)                 │
│  - RAM disk (hot cache for frequent access)         │
│  - Object storage (archival, cold data)             │
└─────────────────────────────────────────────────────┘

5.3 Expected Performance

Background Removal Workload (10M images/day):

Architecture	Images/Sec	Machines Needed	Monthly Cost	Cost/1M Images
A100 GPU (current)	50	20	$82,000	$246
Hybrid (Axion + TPU)	264	4	$16,800	$50
Cost Reduction	5.3x	5x fewer	79% savings	80% cheaper

Tokenization Workload (50B tokens/day):

Architecture	Tokens/Sec	Machines Needed	Monthly Cost	Cost/1B Tokens
Traditional CPU	5M	10	$38,000	$22.80
Axion + TPU	150M	3	$11,400	$6.84
Cost Reduction	30x	3.3x fewer	70% savings	70% cheaper

6. Industry Implications

6.1 The GPU Vendor Lock-In Problem

The AI industry has created an artificial dependency on GPUs through:

Framework Defaults: PyTorch/TensorFlow default to GPU execution
Benchmarking Bias: Published benchmarks use GPU-friendly batch sizes and workloads
Educational Content: Tutorials assume GPU availability

Reality: For 80% of production inference workloads, GPUs are overkill and mis-configured.

6.2 The Google Cloud Advantage

Google Cloud is uniquely positioned to offer the hybrid architecture because:

Axion CPUs: Best-in-class ARM architecture with exceptional price/performance
TPU Ecosystem: Purpose-built AI accelerators with tight integration
Bare Metal Options: Google Cloud offers bare metal instances with local NVMe

Competitive Moat: AWS and Azure offer GPUs, but neither has:

A competitive ARM CPU offering (Graviton is older generation)
Purpose-built AI accelerators (TPUs)
Tight vertical integration between CPU and accelerator

6.3 The Cost Opportunity

If 10% of current GPU spending ($50B+ annually) shifts to CPU+TPU hybrid architectures with 5x better price/performance:

$40B in customer savings over 5 years
$10B in new workload addressable market (currently uneconomical on GPUs)
Google Cloud market share gain vs. AWS/Azure

7. Future Work

7.1 Bare Metal TPU Benchmarking

Hypothesis: TPU t5-litepod on bare metal (local NVMe) will achieve >400 images/sec by eliminating virtual disk bottleneck.

Experiment: Deploy U2Net-P on bare metal TPU instance, benchmark against our 264 images/sec baseline.

Timeline: Q1 2026

7.2 Axion Tooling Maturation

Hypothesis: With optimized Rust crates for ARM64 image processing, Axion 72-core can match or exceed AMD EPYC 384-core performance at lower cost.

Experiment: Port Rust image crates (image, imageproc) to ARM64 with NEON SIMD optimization.

Timeline: Q2 2026

7.3 TPU Tokenizer Development

Hypothesis: Offloading tokenization to TPU (after Axion preprocessing) will achieve >150M tokens/sec.

Experiment: Implement custom TensorFlow tokenization ops, deploy on TPU, benchmark vs. CPU baseline.

Timeline: Q2 2026

7.4 Real-Time Video Processing

Hypothesis: Real-time background removal (30+ FPS on 1080p video) is achievable with Axion preprocessing + TPU inference.

Experiment:

Build video pipeline (decode frames on Axion)
Batch frames and send to TPU for background removal
Composite with digital background
Measure end-to-end latency

Use Case: Replace physical green screens in streaming/video production.

Timeline: Q3 2026

8. Conclusion

The AI industry’s GPU-first paradigm is a local optimum—effective for training, but suboptimal for high-throughput inference. Our empirical benchmarks across six platforms demonstrate that:

Commercial GPU instances underperform (5x slower, 6x more expensive) due to CPU preprocessing bottlenecks and virtual disk I/O limitations.
TPUs excel when paired with balanced infrastructure (264 images/sec on virtual disk, projected >400 images/sec on bare metal).
High-core-count CPUs with mature tooling (AMD EPYC, Axion) can match or exceed GPU performance for inference workloads.
Hybrid architectures combining CPU preprocessing, TPU/GPU acceleration, and bare metal I/O achieve 5-10x cost reduction while improving performance.

The path forward is clear: full-stack optimization beats hardware brute force. Organizations that recognize this will achieve dramatic cost savings and unlock previously uneconomical AI applications.

Google Cloud, with Axion CPUs, TPUs, and bare metal infrastructure, is positioned to lead this transition—but only if the industry challenges the GPU-first narrative.

This whitepaper is our challenge.

Appendix A: Benchmark Methodology Details

A.1 Hardware Specifications

NVIDIA A100 80GB (Google Cloud):

Instance type: a2-highgpu-1g
GPU: 1x A100 80GB (PCIe)
CPU: 12 vCPU Intel Cascade Lake
Memory: 85 GB
Storage: 100GB persistent disk (SSD)
Software: Ubuntu 22.04, CUDA 12.1, PyTorch 2.0

NVIDIA A100 40GB (Google Cloud):

Instance type: a2-highgpu-1g
GPU: 1x A100 40GB (PCIe)
CPU: 12 vCPU Intel Cascade Lake
Memory: 85 GB
Storage: 100GB persistent disk (SSD)
Software: Ubuntu 22.04, CUDA 12.1, PyTorch 2.0

Google TPU t5-litepod (Google Cloud):

Instance type: v2-8
TPU: 8 cores (t5-litepod)
CPU: 96 vCPU
Memory: 335 GB
Storage: 100GB persistent disk (SSD)
Software: Ubuntu 22.04, TensorFlow 2.13

AMD EPYC c4d-highcpu-384 (Google Cloud):

Instance type: c4d-highcpu-384
CPU: 384 vCPU AMD EPYC Milan
Memory: 864 GB
Storage: 100GB persistent disk (SSD)
Software: Ubuntu 24.04, Rust 1.92, ONNX Runtime 1.22

NVIDIA RTX 3050 (Consumer Laptop):

GPU: RTX 3050 4GB
CPU: Intel i7-11800H (8 cores, 16 threads)
Memory: 64 GB DDR4
Storage: 2TB NVMe Samsung EVO Pro
Software: Linux Arch 6.17, Rust 1.92, ONNX Runtime 1.22

A.2 Dataset

Source: COCO 2017 validation set (5,000 images)
Format: JPEG, average size 500KB
Resolution: Variable (resized to 512x512 for model input)
Total size: ~2.5 GB

A.3 Measurement Protocol

Warmup: Process 100 images (excluded from timing)
Measurement: Process 10,000 images (2 passes through dataset)
Metrics Collected:
- Total elapsed time (wall clock)
- Per-image latency (p50, p95, p99)
- CPU utilization (average, peak)
- Memory usage (RAM, VRAM)
- Disk I/O throughput (MB/sec)
- Disk IOPS (operations/sec)
Repeatability: Each benchmark run 3 times, median reported

Appendix B: Cost Calculations

B.1 Instance Pricing (Google Cloud, us-central1, on-demand)

Instance Type	Hourly Cost	Monthly Cost (730 hrs)
a2-highgpu-1g (A100 80GB)	$4.60	$3,358
a2-highgpu-1g (A100 40GB)	$3.40	$2,482
v2-8 (TPU t5-litepod)	$1.92	$1,402
c4d-highcpu-384 (AMD EPYC)	$5.20	$3,796

B.2 Cost Per Million Images Calculation

Cost/1M images = (Hourly cost × (1M / images_per_sec) / 3600)

Examples:
- A100 80GB: $4.60 × (1,000,000 / 50) / 3600 = $255
- TPU t5-litepod: $1.92 × (1,000,000 / 264) / 3600 = $28
- AMD EPYC 384: $5.20 × (1,000,000 / 238) / 3600 = $31

Appendix C: Software Stack Details

C.1 GPU (PyTorch)

import torch
import torchvision.transforms as transforms
from PIL import Image

model = torch.load('u2netp.pth').cuda()
model.eval()

transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])
])

for img_path in image_paths:
    img = Image.open(img_path)
    img_tensor = transform(img).unsqueeze(0).cuda()

    with torch.no_grad():
        mask = model(img_tensor)

C.2 TPU (TensorFlow)

import tensorflow as tf

resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

with strategy.scope():
    model = tf.keras.models.load_model('u2netp_tf')

dataset = tf.data.TFRecordDataset(files)
dataset = dataset.map(preprocess_fn)
dataset = dataset.batch(32)

predictions = model.predict(dataset)

C.3 CPU (Rust + ONNX)

use ort::{Session, SessionBuilder, Value};
use image::{ImageBuffer, Rgb};

let session = SessionBuilder::new()?
    .with_intra_threads(16)?
    .commit_from_file("u2netp.onnx")?;

for img_path in image_paths {
    let img = image::open(img_path)?;
    let resized = image::imageops::resize(&img, 512, 512, FilterType::Lanczos3);

    let tensor = ndarray::Array4::from_shape_fn((1, 3, 512, 512), |(_, c, y, x)| {
        resized[(x as u32, y as u32)][c] as f32 / 255.0
    });

    let outputs = session.run(vec![Value::from_array(tensor)?])?;
    let mask = outputs[0].extract_tensor::<f32>()?;
}

Appendix D: Profiling Data

D.1 A100 80GB Profiling (PyTorch Profiler)

Time breakdown per image (average):
  CPU preprocessing: 17ms (85%)
    - Image decode (PIL): 8ms
    - Resize: 5ms
    - ToTensor + normalize: 4ms
  GPU transfer (CPU→GPU): 1ms (5%)
  GPU inference: 2ms (10%)
  GPU transfer (GPU→CPU): 0.5ms (2.5%)
  Disk I/O (save result): 1-17 seconds per image (major bottleneck)

Bottleneck: CPU preprocessing + Disk I/O (95% CPU utilization, 12 cores maxed out)
GPU utilization: 30% (starved for data, struggling with preprocessing and I/O)

D.2 TPU t5-litepod Profiling (TensorFlow Profiler)

Time breakdown per batch (64 images):
  TPU inference: 240ms (batch processing time)
  Disk I/O (save result): 1-17 seconds per image (major bottleneck)
  CPU preprocessing: Similar I/O struggles as GPU

Bottleneck: Disk I/O (persistent disk throughput limit)
TPU utilization: Heavily I/O-bound during save operations

D.3 AMD EPYC 384-core Profiling (Linux perf)

Time breakdown per image (average):
  Disk I/O: 1.2ms (28%)
  Image decode: 1.5ms (35%)
  Resize: 0.8ms (19%)
  ONNX inference: 0.7ms (16%)
  Postprocessing: 0.1ms (2%)

Bottleneck: CPU-bound at full utilization
CPU utilization: 99% (384 cores fully utilized)
Memory usage: 230GB RAM active

Last Updated: November 6, 2025 Contact: research@quantumencoding.io