The "CPU Tax" Problem: In traditional architectures, CPUs handle networking and data tasks, leaving expensive GPUs idle. F5 DPUs offload this overhead, creating a "fast lane" that liberates CPUs and unlocks GPU potential for higher utilization and throughput.
๐ View Detailed Function Breakdown
| Function | CPU Overhead | F5 Offload | Notes |
|---|---|---|---|
| KV Cache Management | 15-25% | โ Full | Grows with context length; biggest gain for large models |
| Network I/O & Protocols | 10-15% | โ Full | TCP/IP stack, gRPC/REST parsing, response assembly |
| SSL/TLS Processing | 8-12% | โ Full | Encryption/decryption, certificate validation |
| Memory & DMA Operations | 8-12% | โ Partial | Buffer management, data movement |
| Request Batching & Scheduling | 5-10% | โ Partial | Dynamic batching decisions, queue management |
| Load Balancing & Security | 5-10% | โ Full | Request routing, health checks, WAF |
| Tokenization & Telemetry | 3-8% | โ None | Pre/post processing, metrics, logging |
| Category | Without F5 | With F5 | Delta |
|---|---|---|---|
| Total TCO |
๐ GPU Cloud Economics Methodology & Formulas
Baseline = GPU Count ร GPU Power (W) ร (0.25 + 0.75 ร Utilization) รท 1000With F5 = Baseline Power + (DPU Count ร 50W รท 1000)Example: 5,000 GPUs ร 700W ร 0.78 = 2,730 kW baseline
F5 ARR/MW = Benchmark ร (1 + Throughput Lift ร 0.7 + Efficiency Lift ร 0.3)70% weight on throughput (more billable tokens), 30% on efficiency (lower cost per token)
Incremental ARR = (F5 Power MW ร F5 ARR/MW) - (Baseline Power MW ร Baseline ARR/MW)Total new revenue capacity unlocked by F5 optimization
Density Lift % = ((F5 ARR/MW รท Baseline ARR/MW) - 1) ร 100How much more revenue you generate per megawatt of power
GM/MW = Power Savings/MW + Throughput Value/MWThroughput Value: GPUs/MW ร $2/GPU-hr ร 8,760 hrs ร 75% util ร Throughput Gain % ร 30%
Density Increase % โ Throughput Improvement %More tokens/sec per GPU = more concurrent customers on same hardware. Tenant multiplier: 1.0ร โ (1 + Throughput%)ร
Util Lift = Throughput Improvement % ร 60%60% of throughput gains convert to additional billable GPU-hours (capped at 95% effective utilization)
Tokens/$/Watt = (Tokens/sec) รท (Power Cost/hr) รท WattsMeasures operational efficiency - tokens generated per dollar of electricity per watt consumed
Higher Tokens/$/Watt directly amplifies ARR per MW because you generate more billable output from the same power budget. At datacenter scale, even small efficiency gains compound dramatically.
ARR Impact = Baseline ARR/MW ร (TPDW Improvement %) ร Revenue Conversion FactorRevenue conversion ~40-60% of efficiency gains translate to bottom-line ARR improvement
โข With F5: Higher Tokens/$/Watt โ More tokens per energy dollar
โข Result: Each % improvement in Tokens/$/Watt โ $100M additional ARR capacity at GW scale
Baseline ARR at Scale = Scale (GW) ร 1000 ร ARR/MW BenchmarkF5 ARR at Scale = Baseline ARR ร (1 + Revenue Density Lift %)GPUs at Scale โ 1,400 GPUs per MW (at 700W per GPU)
Khosla Ventures AI Infrastructure Research
$10M ARR/MW benchmark for GPU cloud infrastructure. Top performers achieve $15-20M/MW.
What does this mean?
Larger GPU deployments achieve higher ROI due to operational efficiencies that scale non-linearly. This multiplier reflects real-world benefits including:
Impact: A deployment at 2,048 GPUs with a 1.30ร scale factor sees 30% higher net benefits than the same configuration at 256 GPUs โ directly boosting ROI, NPV, and shortening payback period.
Why throughput % โ payback speed: A 57% throughput improvement generates incremental token revenue, but payback depends on the net benefit after subtracting power costs and F5 license fees. See the financial breakdown below for your specific values.
- Annual ROI: Your yearly return on F5 investment. Above 40% is good; above 100% is excellent.
- 3-Year NPV: Total value created over the license term, discounted to today's dollars. Positive = profitable investment.
- Payback Period: Months until F5 costs are recovered. Formula: (Annual F5 Cost รท Net Annual Benefit) ร 12. Under 12 months indicates strong value.
- IRR: Annualized return rate. Should exceed your hurdle rate (default 12%) to justify investment.
- Green metrics: Investment is profitable and exceeds benchmarks
- Yellow metrics: Marginal returns - consider adjusting configuration
- Red metrics: Negative ROI - try larger models or more complex workloads
- Utilization boost: The core value driver - F5 recovers wasted GPU cycles from CPU bottlenecks
| Before | With F5 | Improvement | |
|---|---|---|---|
| GPU Utilization | 45% | 72% | +60% |
| Throughput (tok/s/GPU) | 450 | 720 | +60% |
| Tokens per Joule | 0.64 | 0.89 | +39% |
| Tokens/$/Watt โ | 0.00 | 0.00 | +0% |
| TTFT Latency | 150ms | 120ms | -20% |
| F5 License (3-yr amortized) CapEx | $1,250,000 | ||
| Incremental Token Revenue (F5) | $2,100,000 | ||
| OpEx Savings | $150,000 | ||
| GPU Capacity Value (20% realization) | $0 | ||
| Power Cost Delta | +$85,000 | ||
| Net Annual Benefit | $915,000 | ||
Important: Even with a high throughput improvement (e.g., 57%), payback depends on the dollar value of that improvement after costs:
- Incremental Token Revenue: Only the additional tokens/sec from F5 (not total revenue)
- GPU Capacity Value: Economic value of freed GPU capacity (tiered realization rate by maturity)
- OpEx Savings: Reduced orchestration, networking, and operational costs
- Minus Power Delta: DPUs consume power, adding to operating costs
- Minus F5 License Cost: The annual subscription fee for F5 DPUs
๐ก Example: If F5 costs $100K/year and net benefit is $73.6K/year โ Payback = ($100K รท $73.6K) ร 12 = 16.3 months
Why tiered rates? Not all freed GPU capacity can be immediately monetized. The realization rate depends on your operational maturity and demand profile:
Capacity = growth runway
Some immediate use
Can fill capacity fast
Your current stage: Emerging (20%)
Calculation: 333 GPUs freed ร $10.2M gross ร 20% = $2.04M realized
Incremental Token Revenue =
(F5 Throughput - Base Throughput) ร Model $/Token ร GPU Count ร Hours/Yearโณ Only the additional tokens/sec from F5 are counted โ not your baseline revenue
F5 Benefit =
Base Boost ร Model Memory Factor ร Workload Factor ร GPU Factorโณ Small models: tiny KV cache (0.25ร) | Large models: massive KV cache (1.5ร)
โข Here (AI Factory): Neocloud providers at 85% baseline โ with F5: ~96%
โข Dashboard tab: Your specific deployment metrics and ROI
What This Tab Shows: This analysis models how F5 DPUs impact the unit economics of GPU cloud providers ("neoclouds") like IREN, CoreWeave, and Nebius. These companies rent GPU capacity and their profitability depends heavily on GPU utilization rates.
- GPU Utilization: 85% โ 96%
- Revenue/MW-Year: +$1.33M additional
- Margin Improvement: +3.2 pts
- Gross Profit/MW: +$0.90M annually
- IREN: 35.8% โ 39.0% margin
- Nebius: 38.1% โ 41.3% margin
- CoreWeave: 30.6% โ 33.8% margin
| Metric | IREN (Bare-Metal) | Nebius (Full-Stack) | CoreWeave (Full-Stack) |
|---|
Current: IREN's bare-metal model yields 35.8% margins with owned DC advantage ($0.47M DC depreciation vs $0.72-0.96M colo costs for competitors).
With F5: Adding F5 DPU offload improves utilization to 90%+, enabling IREN to capture an additional +4.3 pts margin. Combined with their low power costs ($0.22M/MW), F5 helps IREN bridge the gap to Nebius-level margins without building full software stack.
Potential: If IREN builds full-stack on top of F5-optimized infrastructure, theoretical margin reaches 48.6%.
Current: Nebius commands highest normalized margins (38.1%) through full-stack pricing premium (+20-25% revenue/GPU-hr) and diversified customer base (Cursor, Shopify, etc.).
With F5: F5 DPU amplifies their utilization advantage (whitepaper claims 100% benchmark performance). Moving from 95-97% to near-100% effective utilization captures additional +4.2 pts margin.
Moat: Customer base diversification + F5-enhanced platform performance creates sustainable competitive advantage vs colo-heavy competitors.
This analysis normalizes neocloud GPU pricing economics from public filings and industry research:
- GPU normalization: H100, 4-year depreciation ($3.50M/MW)
- Utilization baseline: 85% (industry standard for datacenter-scale infrastructure)
- Revenue/GPU-hr: $2.50-2.75 bare-metal, $2.80-3.50 full-stack (historical pricing)
- Infrastructure: 400 GPUs per MW with full networking, cooling, InfiniBand-class interconnect
- F5 DPU impact: +5.8% utilization boost (from CPU offload), networking efficiency gains, reduced middleware overhead
- Debt not included: CoreWeave's $1.3B/year interest would further pressure margins
Base Formula:
H100 Baseline Example (IREN Bare-Metal):
| GPUs per MW | 400 (H100 @ 700W each) |
| $/GPU-hr (bare-metal) | $2.65 (market rate) |
| Hours/Year | 8,760 |
| Utilization | 85% |
| = Revenue/MW-Year | $7.89M โ $7.80M |
GPU Performance Scaling:
Revenue scales with GPU throughput multiplier (faster GPUs command higher prices):
| GPU | Multiplier | GPUs/MW | IREN Revenue |
|---|---|---|---|
| H100 | 1.0ร | 400 | $7.80M |
| B100 | 1.4ร | 280 | $10.92M |
| B200 | 1.69ร | 280 | $13.18M |
| GB200 | 2.5ร | 280 | $19.50M |
F5 DPU Revenue Uplift:
Example: $7.80M ร (85% + 5.8%) / 85% = $8.33M (+$0.53M uplift)
Note: Full-stack providers (Nebius, CoreWeave) command 15-25% revenue premiums over bare-metal due to managed services, ML frameworks, and customer support bundled into pricing.
Why Model Size Matters:
F5 DPU value scales with model size because larger models have bigger KV caches, higher memory pressure, and more CPU overhead to offload. Small models are compute-bound with minimal memory bottlenecks.
| Model Size | F5 Benefit | $/1M Tokens | Rev Factor | Expected ROI |
|---|---|---|---|---|
| 1B - 8B | 0.15ร - 0.30ร | $0.03 - $0.10 | 0.03ร - 0.10ร | Negative (-90%) |
| 13B - 32B | 0.45ร - 0.70ร | $0.15 - $0.40 | 0.15ร - 0.40ร | Negative to marginal |
| 70B - 72B | 1.0ร | ~$1.00 | 1.0ร (base) | 40-60% (baseline) |
| 175B | 1.20ร | ~$5.00 | 5.0ร | 200-300% |
| 405B | 1.35ร | ~$12.00 | 12.0ร | 350-400% |
| 671B | 1.50ร | ~$18.00 | 18.0ร | 500%+ |
Formulas applied:
โข Utilization Boost = Base Boost ร Model F5 Factor ร [other factors]
โข Revenue Value = Throughput Gain ร Base $/Token ร Model Revenue Factor
Source: Tolly Enterprises, LLC. Test Report #226104. F5 BIG-IP Next for Kubernetes (BNK) on DPU v2.2.0.
Two Value Streams:
Value = Servers ร 10 cores ร $/core-hr ร 8,760 hrs ร Workload multiplier + CPU power savings
1B: +406% throughput, 96% TTFT, 80% latency | 8B: +114%, 76%, 53% | 70B: +40%, 61%, 29%
Value = Additional tokens/sec ร GPUs ร 3600 ร 8,760 ร $/token + TTFT retention
Report: tolly.com/publications/226104
What This Tab Shows: Sensitivity analysis reveals which input parameters have the greatest impact on your ROI, helping you understand where uncertainty matters most and where to focus negotiation or optimization efforts.
- Longer bars = higher sensitivity: Parameters with wide bars have outsized impact on ROI
- Red (left): ROI when parameter decreases by 50%
- Green (right): ROI when parameter increases by 50%
- Focus on top bars: These are your key risk/opportunity levers
- Heatmap: Shows ROI for all combinations of F5 cost vs GPU:DPU ratio
- Green zones: Profitable configurations to target
- Red zones: Configurations to avoid
- Elasticity >1: ROI changes faster than the parameter (high risk/reward)
| Parameter | Low Value | Base Value | High Value | ROI @ Low | ROI @ High | Elasticity | Risk Level |
|---|---|---|---|---|---|---|---|
| Run analysis to populate | |||||||
What This Tab Shows: Monte Carlo simulation runs thousands of scenarios with randomized inputs to show the full range of possible ROI outcomes. Instead of a single point estimate, you see the probability distribution of returns.
- Mean ROI: Average outcome across all simulations
- P10 (Downside): 90% of outcomes are better than this - your "worst realistic case"
- P50 (Median): Half of outcomes are above, half below - the "typical" result
- P90 (Upside): Only 10% of outcomes are better - your "best realistic case"
- P(ROI > 50%): Probability of achieving above-target returns
- VaR 95%: Maximum expected loss in 95% of scenarios
- CVaR (Expected Shortfall): Average loss in the worst 5% of scenarios
- Sharpe Ratio: Risk-adjusted return (higher = better risk/reward)
- Narrow histogram: Low uncertainty, predictable outcomes
- Wide histogram: High uncertainty, variable outcomes
What This Tab Shows: Scenario comparison lets you evaluate different deployment strategies side-by-side, from conservative to aggressive configurations. Use this to find the optimal balance of risk and return for your organization.
- Conservative: Lower risk, modest returns (sparse ratio, standard workloads)
- Balanced: Optimal risk/reward tradeoff (8:1 ratio, mixed workloads)
- Aggressive: Maximum ROI, higher execution risk (dense deployment, frontier models)
- Your Config: Current settings for direct comparison
- ROI: Annual return on investment (higher = better)
- NPV: Total value created over license term
- Payback: Time to recover investment (shorter = lower risk)
- Rating: Overall assessment (โญ to โญโญโญโญโญ)
| Scenario | GPU:DPU | F5 Cost | ROI | NPV | Payback | Rating |
|---|---|---|---|---|---|---|
| Click "Run Comparison" to analyze scenarios | ||||||
Scenario A (Current)
| GPU:DPU Ratio | 1:8 |
| F5 Cost | $10,000 |
| ROI | 101% |
| NPV | $2.1M |
Scenario B (Optimized)
| GPU:DPU Ratio | 1:16 |
| F5 Cost | $8,000 |
| ROI | 185% |
| NPV | $3.8M |
What This Tab Shows: Total Cost of Ownership (TCO) compares the full financial picture over 1-5 years: all costs (hardware, power, operations, licensing) versus all value generated (throughput, efficiency gains).
- GPU CapEx: Hardware investment for your GPU fleet
- Power & Cooling: Electricity costs at your utilization rate
- Operations: Staff, maintenance, software overhead
- Throughput Value: Revenue potential at baseline utilization
- Added DPU Cost: F5 hardware + licensing investment
- Power Change: Slight increase from DPUs, offset by efficiency
- OpEx Reduction: Lower orchestration and management overhead
- Enhanced Throughput: Higher revenue from improved utilization
What This Tab Shows: Year-by-year breakdown of all cash inflows and outflows from F5 DPU deployment, including discounted cash flows (DCF) for proper time-value-of-money analysis.
- Investment: F5 licensing cost (Year 0 = initial, subsequent = renewals)
- Throughput Value: Revenue from improved GPU utilization
- OpEx Savings: Reduced operational costs (management, orchestration)
- Power Delta: Net change in electricity costs
- Net Cash Flow: Total benefit minus total costs for each year
- Discounted CF: Today's value of future cash flows (at your discount rate)
- Cumulative: Running total - when this turns positive, you've broken even
- Year 0: Initial investment (negative cash flow)
- Years 1+: Should show positive net cash flow if ROI > 0
| Year | DPU Hardware | Incr. Revenue | OpEx Savings | Power Delta | F5 License | Net Cash Flow | Discounted CF | Cumulative |
|---|
What This Tab Shows: Break-even analysis identifies the critical thresholds for each parameter - the maximum cost you can pay or minimum scale you need for F5 DPUs to remain profitable.
- Max F5 Cost: Highest $/DPU license that keeps ROI โฅ 0%
- Min GPU Count: Smallest deployment that remains profitable
- Max Electricity Rate: Highest power cost before margins go negative
- Bar indicator: Green = safe margin, Yellow = close to threshold, Red = over limit
- Target ROI column: Desired return hurdles (25%, 50%, 100%)
- Required values: What each parameter must be to hit that target
- Use for negotiation: "We need F5 cost โค $X to hit our 50% ROI hurdle"
- Feasibility check: If required values are unrealistic, adjust expectations
| Target ROI | Required F5 Cost โค | Or DPU:GPU Ratio โฅ | Or GPU Count โฅ | Achievable? |
|---|---|---|---|---|
| Click "Calculate Break-Even" to analyze | ||||
NVIDIA BlueField-4 powers the Inference Context Memory Storage Platform, a new class of AI-native storage infrastructure for gigascale inference. Configure storage synergy to include infrastructure costs in TCO calculations and unlock additional F5 DPU performance benefits.
โก Quick Presets
๐ข Storage Vendor
๐ Interconnect
๐ฐ Storage Infrastructure Cost (CapEx)
Model Overview
This calculator models F5 DPU ROI using a multi-factor approach that accounts for model size, workload complexity, GPU:DPU ratio, and infrastructure configuration. The model has been calibrated against industry benchmarks and real-world neocloud economics (IREN, CoreWeave, Nebius).
๐ Core ROI Formula
๐งช Tolly Report #226104 โ Separated Value Streams (March 2026)
F5 BNK on DPU uses ~2 CPU cores vs HAProxy's ~12 cores โ an 83% reduction. This frees ~10 cores per server for AI application processing.
GPU-aware traffic steering avoids sending requests to busy GPUs. Tolly tested three models vs HAProxy:
| Metric | 1B | 8B | 70B |
|---|---|---|---|
| Throughput | +406% | +114% | +40% |
| TTFT | 96% | 76% | 61% |
| Latency | 80% | 53% | 29% |
Source: Tolly #226104 โ F5 BNK on DPU v2.2.0, NVIDIA GH200, AIPerf, Feb 2026.
๐ง CPU Offload Breakdown
In traditional AI inference architectures, CPUs handle significant overhead tasks that prevent GPUs from reaching full utilization. F5 DPUs offload these functions to dedicated hardware, freeing CPU cycles and enabling higher GPU throughput.
Larger models have proportionally more CPU-bound operations, especially KV cache management which grows with model parameters and context length.
Total_Overhead = Base_Overhead ร Model_Scale_Factor where scale factors range from 0.7ร (7B) to 1.3ร (405B)
๐ญ GPU Cloud Economics Methodology
The GPU Cloud Economics section is designed specifically for Neoclouds, Hyperscalers, and GPU Cloud Providers who measure success in terms of revenue per megawatt, customer density, and infrastructure efficiency. Unlike traditional enterprise ROI (which focuses on cost savings), cloud economics focuses on revenue maximization from constrained resourcesโpower, space, and capital.
In GPU cloud infrastructure, power is the fundamental constraint. A data center with 100MW of power capacity can't add more GPUs without building new infrastructure. F5 DPUs help providers extract more revenue from the same power envelope by:
- Increasing GPU utilization โ More billable compute hours from the same hardware
- Improving throughput โ Serve more customers/requests with the same GPU count
- Reducing CPU bottlenecks โ GPUs spend more time on inference, less time waiting
- Enabling higher customer density โ More tenants per rack without performance degradation
What it measures: Annual Recurring Revenue generated per megawatt of power consumed. This is the fundamental efficiency metric for GPU cloud providers.
Industry benchmark: $10M/MW (average), $15-20M/MW (top performers like OpenAI)
What it measures: The total additional revenue your specific cluster can generate with F5 optimizationโcalculated as the difference between F5-enhanced and baseline revenue.
Key insight: This is your actual dollar uplift, not a percentage improvement
What it measures: The percentage improvement in revenue efficiencyโhow much more revenue you generate per unit of power with F5 vs. without.
Typical range: 40-80% density lift depending on model size and workload
What it measures: How many more concurrent customers/tenants you can serve on the same infrastructure due to improved throughput.
Business impact: More customers without new capital expenditure
๐ฅ๏ธ GPU Requirements Matrix
Calculate exact GPU requirements based on model size, precision, and context window.
Memory formula: Params ร Bytes_per_Param + KV_Cache_Overhead
| Model | Params | Model Mem | KV Cache | Total | GPUs Needed | Status |
|---|
Model_Memory = Parameters ร Bytes_per_Precision ร 1.1 (activations overhead)KV_Cache = 2 ร Layers ร Hidden_Dim ร Context ร Batch ร Bytes ร 1.2 (fragmentation)GPUs = ceil(Total_Memory / GPU_Memory ร 0.85 utilization factor)
๐ข Deployment Scale Matrix
Estimate total GPUs needed based on deployment tier and target concurrent users. Assumes 70B model, H100 GPUs, ~50 tok/s per replica, ~500 tokens per request.
โข Replicas = Concurrent_Users ร Avg_Request_Duration / Target_Latency (assumes ~10s request, <2s latency target)
โข GPUs per replica: 8B=1, 70B=4, 405B=16 (H100, FP16, 32K context)
โข Throughput: ~50 tok/s per replica at 70B, scales inversely with model size
โข Add 20-30% for redundancy/failover in production
- Throughput per replica: A 70B model on 4ร H100s can handle ~12 requests/min
- Replicas needed: 100 concurrent users รท 12 req/min = ~9 replicas
- GPUs needed: 9 replicas ร 4 GPUs/replica = 36 GPUs base
- +25% HA: Add redundancy for failover = 45 GPUs total
โก F5 Utilization Boost Calculation
| Parameter | Your Value | Description |
|---|---|---|
Base |
5.0% | Base utilization boost at default settings (min floor: 2.5%) |
KV_Cache |
1.00 | KV Cache efficiency / 65 (your: 65%) |
Workload |
0.85 | Workload complexity (basic=0.85, realtime=0.95, moe=1.2) |
Disagg |
1.08 | Disaggregated serving bonus (1.08ร if enabled) |
Ratio |
1.00 | GPU:DPU ratio (4:1=1.16, 8:1=1.0, 16:1=0.68, 32:1=0.04) |
GPU_Gen |
1.00 | GPU generation (V100=0.48, H100=1.0, GB300=2.52) |
Model |
1.00 | Model size factor (7B=0.25, 70B=1.0, 405B=1.35) |
| Result | +5.0% | Utilization boost (85% โ 89%) |
๐พ NVIDIA Inference Context Memory Storage Platform (ICMS)
๐ท NVIDIA BlueField-4 DPU Specifications
๐ ICMS Verified Performance Metrics
G3.5 Memory Tier sits between GPU HBM (G1) and general storage (G4), enabling intelligent context caching.
Capped at maximum 1.35ร multiplier
๐ค Confirmed NVIDIA ICMS Partners (CES 2026)
| Component | Impact Range | Description |
|---|---|---|
KV-Cache Offload |
+25% | Intelligent offloading of KV-cache to BlueField-4 NVMe storage (G3.5 tier) |
CXL Memory Extension |
+20% | CXL 3.0/4.0 memory pooling extends effective GPU memory up to 16TB/GPU |
GPUDirect RDMA |
+8% | Direct GPU-to-storage via ConnectX-9 (800 Gbps) bypasses CPU overhead |
ICMS Partner Bonus |
+2% to +5% | BlueField-4 certified vendors with optimized DOCA integration |
Vendor Synergy |
+3% to +22% | Storage vendor-specific F5 integration (WEKA NeuralMesh, DDN Infinia, etc.) |
TTFT Reduction |
12% to 30% | Time-to-first-token improvement varies by vendor optimization level |
Interconnect Bonus |
+0.5% to +5% | NVLink 6, CXL 4.0, InfiniBand XDR/GDR provide additional gains |
๐ง Key ICMS Technology Components
- NVIDIA Dynamo: Open-source disaggregated inference serving (WEKA, IBM, Dell integration)
- NIXL: NVIDIA Inference Xfer Library for optimized storage-to-GPU data movement
- DOCA Framework: BlueField-4 software stack for storage offload acceleration
- NeuralMesh (WEKA): Augmented Memory Grid providing transparent context caching
- NFS-over-RDMA: High-performance NFS with RDMA transport (Dell PowerScale)
- AI OS Native (VAST): Storage OS running directly on BlueField-4 DPUs
Enable in Admin Configuration โ Storage tab. Select a storage vendor and interconnect technology to model the combined F5 + Context-Aware Storage impact on ROI calculations. Speculative Vera Rubin-era storage systems are marked with estimated 2027+ availability.
๐ฐ Incremental Token Revenue Calculation
Key insight: Token pricing varies dramatically by model size. Small models (7B) charge ~$0.07/1M tokens while frontier models (405B) charge ~$12/1M tokens. This 170ร pricing difference is the primary driver of ROI variation.
๐๏ธ Training Value Calculation
When "Training" or "Mixed" use case is selected, the calculator computes value from three components:
| Component | Formula | Basis |
|---|---|---|
| โก Training Efficiency Gains | GPU_Hours ร GPU_Rate ร (Efficiency% ร Model_Factor) |
Faster data loading (12-18%) Gradient sync improvement (5-10%) I/O optimization (5-8%) Default: 22% (configurable 5-40%) |
| ๐ Data Pipeline Acceleration | GPU_Count ร $500/year ร Model_Factor |
Network/storage offload benefits Reduced data loading bottlenecks Estimate: ~$500/GPU/year |
| ๐พ Checkpoint Optimization | GPU_Count ร $300/year ร Model_Factor |
Faster checkpoint save/restore Reduced training interruption time Estimate: ~$300/GPU/year |
- The $500/GPU/year (data pipeline) and $300/GPU/year (checkpointing) are estimates based on general DPU benefits
- These values are NOT sourced from specific F5 benchmarks
- Actual benefits will vary significantly based on workload characteristics, storage architecture, and network topology
- The Training Efficiency % is configurable (5-40%) โ adjust based on your measured results
Mixed Mode (50/50): When "Mixed" use case is selected, the calculator blends inference value and training value equally:
๐ Disaggregated Serving Bonus
| Boost Location | Multiplier | Applied To |
|---|---|---|
| Utilization Calculation | 1.12ร | F5 utilization boost (f5Boost) |
| Neocloud Impact | 1.08ร | Utilization projections for neocloud economics |
| CPU Overhead | +5% | Additional coordination overhead (which F5 then recovers) |
๐ผ OpEx Savings Calculation
| Component | Default | Range | Description |
|---|---|---|---|
| Base OpEx ($/GPU/year) | $1,000 | $100 - $5,000 |
Annual operational cost per GPU including: โข Management & orchestration overhead โข Network operations โข Monitoring & observability โข Incident response |
| F5 Reduction % | 15% | 5% - 40% |
Percentage of OpEx reduced by F5 DPU: โข Simplified orchestration โข Reduced network complexity โข Automated traffic management โข Lower debugging overhead |
| Workload Factor | 0.9 - 1.5ร | By workload type | Complex workloads (MoE, Multi-Agent) have higher orchestration overhead โ more savings potential |
| Model Factor | 0.22 - 2.75ร | By model size | Larger models require more complex orchestration โ more savings from F5 simplification |
| Ratio Efficiency | 0 - 1.0ร | min(1.0, 8/ratio) | Diminishing returns above 8:1 GPU:DPU ratio. Denser deployments get more DPU leverage. |
- The default $1,000/GPU/year Base OpEx is an estimate
- The default 15% reduction rate is an estimate
- These values are NOT sourced from specific F5 benchmarks
- Actual OpEx varies significantly by organization, team size, and operational maturity
- Both values are configurable in the sidebar โ adjust based on your actual figures
Example Calculation (1000 GPUs, 70B model, Multi-Agent workload, 8:1 ratio):
๐๏ธ Base Utilization by Neocloud Maturity Stage
| Stage | Typical Util | Characteristics | F5 DPU Value Focus |
|---|---|---|---|
| ๐ Emerging | 40-60% |
Bursty customer demand Manual/basic orchestration Inconsistent batch scheduling GPU clusters idle between jobs |
Transformation story: "Reach 75%+ utilization faster" Massive capacity unlock Highest ROI potential |
| ๐ Growing | 60-75% |
Stabilizing customer base Improving orchestration (Kubernetes, Slurm) Some workload diversity Still significant headroom |
Acceleration story: "Path to 85%+ efficiency" Strong capacity gains Very good ROI |
| ๐ข Established | 80-90% |
Mature operations Sophisticated scheduling Optimized batching Limited headroom for capacity gains |
Optimization story: Focus on: โข Latency (11ร TTFB improvement) โข Throughput (+20-30% tokens/sec) โข Operational simplification |
Same F5 investment, different value story: An emerging neocloud at 42% utilization might see 200%+ ROI from capacity recovery, while an established neocloud at 85% might see 50% ROIโbut their value comes from latency, throughput, and operational benefits rather than capacity. Both are valid use cases. The calculator adapts to show the appropriate value proposition for each maturity stage.
๐ GPU Capacity Freed Calculation
Why divide by Base Utilization? This answers: "How many additional GPUs would I need WITHOUT F5 to achieve the same output?" For example, if F5 improves utilization from 85% โ 92% on 256 GPUs, that's equivalent to having 21 additional GPUs at the original 85% utilization (256 ร 7% รท 85% = 21).
Critical Assumption: Not all freed GPU capacity translates directly to revenue. The realization rate accounts for demand constraints, ramp-up time, and market conditions. We tier this by neocloud maturity:
| Maturity Stage | Realization Rate | Rationale |
|---|---|---|
| ๐ Emerging (40-60% util) | 20% | Already underutilized due to low demand. Freed capacity = growth runway, not immediate revenue. Customer acquisition takes time. |
| ๐ Growing (60-75% util) | 35% | Stabilizing demand with growing customer base. Mix of immediate monetization and growth runway. |
| ๐ข Established (80-90% util) | 60% | High demand, often capacity-constrained. Waiting lists, premium pricing. Can immediately monetize freed capacity. |
โข 333 GPUs freed ร 8,760 hrs ร $3.50/hr = $10.22M gross
โข At 20% realization: $10.22M ร 20% = $2.04M realized value (included in ROI)
| GPU | $/hr | GPU | $/hr | GPU | $/hr |
|---|---|---|---|---|---|
| V100-16 | $1.20 | A100-80 | $2.80 | H200 | $4.50 |
| V100-32 | $1.50 | H100 | $3.50 | B100 | $5.50 |
| A100-40 | $2.20 | H100-NVL | $3.20 | B200/GB200 | $6.50-$8.00 |
Rates based on CoreWeave, Lambda Labs, and major cloud provider pricing as of December 2025. On-demand rates; reserved/committed pricing typically 30-50% lower.
- GPUs Freed: The equivalent number of additional GPUs you would need to purchase WITHOUT F5 to achieve the same total output. This represents the capacity gain from improved utilization.
- GPU-Hours/Year: Total compute time freed annually. Use this for capacity planning and workload scheduling.
- Capacity Value: The economic value of the freed capacity, priced at market GPU rental rates. This represents potential additional revenue or avoided GPU procurement costs.
The "GPUs Freed" metric exhibits diminishing returns as base utilization increases. This is mathematically correct and economically meaningful.
| Base Util | F5 Util | Boost | GPUs Freed (256 GPUs) | F5 Value Focus |
|---|---|---|---|---|
| 75% | 85% | +10 pts | 34.1 GPUs | Capacity recovery (high waste to capture) |
| 85% | 92% | +7 pts | 21.1 GPUs | Balanced (capacity + performance) |
| 90% | 95% | +5 pts | 14.2 GPUs | Latency & throughput improvements |
| 95% | 98% | +3 pts | 8.1 GPUs | Operational simplification & latency |
- Low utilization (60-80%): Lead with capacity recovery storyโF5 "unlocks" significant GPU equivalents
- Medium utilization (80-90%): Balanced pitchโcapacity gains plus performance improvements
- High utilization (90%+): Lead with latency (11ร TTFB), throughput (+20-30%), and operational benefitsโcapacity gains are secondary
๐ Sourced Benchmarks: F5 BIG-IP + NVIDIA BlueField
All parameters are derived from published benchmarks, vendor testing, and analyst reports. Click any source link for detailed methodology.
| Metric | ๐ Aggressive | โก Standard | ๐ Conservative | Primary Source |
|---|---|---|---|---|
| CPU Offload | 99% | 70% | 30% | F5/SoftBank PoC (July 2025); Red Hat BF-2; VMware vSphere 8 |
| GPU Util Improvement | +50% | +30% | +15% | PIPO research (2025): 40%โ90% GPU utilization |
| TTFB Reduction | 91% (11ร) | 60% | 30% | F5/SoftBank PoC: 11ร TTFB improvement on H100 cluster |
| Token Throughput | +30% | +20% | +10% | F5 NCP Architecture Blog (Oct 2025) |
| TCO Reduction | 30% | 17.8% | 10% | NVIDIA 10K server study (Nov 2022): $148Mโ$121.7M |
| Power Reduction | 34% | 24% | 15% | NVIDIA VMware (34%); Ericsson 5G UPF (24%); NREL (15%) |
| Networking Savings | 40% | 30% | 15% | F5 infrastructure consolidation; Red Hat BF-2 testing |
| Util Boost Base | 7.0% | 5.0% | 3.0% | Calibrated from CPU offload + throughput research |
| Minimum Floor | 4.0% | 2.5% | 1.5% | Guaranteed minimum benefit from DPU offload |
- F5/SoftBank PoC (July 2025): BIG-IP + BlueField-3 on H100 cluster - 99% CPU offload, 11ร TTFB, 190ร energy efficiency
- NVIDIA DPU Power Efficiency (Nov 2022): 10K server study - $26.6M savings (17.8% TCO reduction)
- MangoBoost MLPerf v5.0 (Apr 2025): 103K tokens/sec on 32ร MI300X with DPU acceleration
- Red Hat BlueField-2: 70% CPU reduction, IPsec at 100 Gbps line rate
๐ง Model Size Parameters
| Model | Tok/sec | ๐ Aggressive | โก Standard | ๐ Conservative | $/1M | |||
|---|---|---|---|---|---|---|---|---|
| F5ร | ROI* | F5ร | ROI* | F5ร | ROI* | |||
* ROI calculated with current settings: 256 GPUs,
8:1 ratio,
realtime workload,
disaggregated architecture
๐ Scale Factor: 1.00ร
(economies of scale applied at larger deployments)
| GPU Count | Scale Factor | Rationale |
|---|---|---|
| <128 | 0.95ร - 1.00ร | Small: Overhead inefficiency |
| 128 - 256 | 1.00ร | Entry enterprise: Baseline |
| 256 - 512 | 1.00ร - 1.08ร | Mid-size: Modest efficiency |
| 512 - 1,024 | 1.08ร - 1.18ร | Large: Operational leverage |
| 1,024 - 2,048 | 1.18ร - 1.30ร | Enterprise: Volume discounts |
| 2,048 - 4,096 | 1.30ร - 1.45ร | Hyperscale: Significant leverage |
| >4,096 | 1.45ร - 1.60ร | Mega-scale: Max benefits (capped) |
Scale factors reflect operational leverage, F5 volume licensing, and infrastructure efficiency gains at larger deployments.
- Peak performance envelope
- Expert-tuned infrastructure
- Maximum batch concurrency
- Long context + high KV pressure
- For opportunity sizing
- Upper performance envelope
- Well-optimized deployments
- High batch concurrency
- Production workloads
- For planning & budgeting
- Baseline F5 benefits
- Initial deployment estimates
- Risk-averse planning
- Early-stage / PoC
- For CFO approval
๐ฅ๏ธ NVIDIA Data Center GPU Specifications (Dec 2025)
| GPU | VRAM (GB) | Bandwidth (TB/s) | TDP (W) | Architecture | 1ร GPU | 2ร GPU | 4ร GPU | 8ร GPU |
|---|---|---|---|---|---|---|---|---|
| V100-16 | 16 | 0.9 | 300 | Volta | 7B | 13B | 32B | 70B |
| V100-32 | 32 | 0.9 | 300 | Volta | 13B | 32B | 70B | 175B |
| A100-40 | 40 | 1.6 | 400 | Ampere | 13B | 32B | 70B | 175B |
| A100-80 | 80 | 2.0 | 400 | Ampere | 32B | 70B | 175B | 405B |
| H100 | 80 | 3.35 | 700 | Hopper | 32B | 70B | 175B | 405B |
| H200 | 141 | 4.8 | 700 | Hopper | 70B | 175B | 405B | 671B |
| B100 | 192 | 8.0 | 700 | Blackwell | 70B | 175B | 405B | 671B |
| B200 | 192 | 8.0 | 1000 | Blackwell | 70B | 175B | 405B | 671B |
| GB200 | 384 | 16.0 | 1000 | Grace-Blackwell | 175B | 405B | 671B | 671Bร2 |
| GB300 | 288 | 16.0 | 1200 | Grace-Blackwell | 175B | 405B | 671B | 671Bร2 |
| R200 NEW | 288 | 22.0 | 1800 | Rubin (2H 2026) | 175B | 405B | 671B | 1T+ |
| R200-Ultra 2027 | 1024 | 44.0 | 2200 | Rubin-Ultra (2027) | 405B | 671B | 1T+ | 2T+ |
Max Model Capacity: FP16 weights + KV cache at 65% utilization. Multi-GPU requires NVLink/NVSwitch for tensor parallelism. โ 405B capable โ 671B+ capable (DeepSeek/Llama 4 scale)
โก NVIDIA Vera Rubin (R200) - Power & Thermal Considerations
- TDP: 1,800W (2.6ร H100)
- Memory: 288GB HBM4 @ 22 TB/s
- Compute: 50 PFLOPS FP4 inference
- Cost: ~$200K+ estimated
- Availability: 2H 2026
- Liquid cooling mandatory
- NVL72 rack: ~120-130kW total
- 8ร perf/watt vs Blackwell (inference)
- 10ร lower cost per token vs Blackwell
- R200-Ultra (2027): 2,200W, 1TB HBM4e
๐ Workload Classification
Standard
- Basic Queries
- Batch Inference
- Real-Time Inference
Lower CPU overhead (20-35%)
Advanced
- RAG Pipeline
- Test-Time Compute
- MoE Models
- Synthetic Data Gen
Higher orchestration overhead (35-55%)
Agentic
- AI Agents
- Multi-Agent Systems
- Deep Reasoning (o1-style)
Highest F5 benefit (1.25-1.45ร)
๐ง Workload ร Model Impact Matrix
Expected F5 DPU ROI by workload type and model size. F5 benefit scales with orchestration complexity.
| Workload | CPU Overhead | F5 Benefit | 7B | 8B | 13B | 32B | 70B | 175B | 405B | 671B |
|---|---|---|---|---|---|---|---|---|---|---|
| Basic Queries | 25% | 1.0ร | -85% | -75% | -60% | -45% | -35% | -15% | +5% | +15% |
| Real-time Serving | 30% | 1.1ร | -70% | -60% | -45% | -25% | -5% | +20% | +45% | +65% |
| RAG Pipeline | 45% | 1.3ร | -55% | -45% | -25% | 0% | +25% | +55% | +90% | +120% |
| Synthetic Data Gen | 35% | 1.15ร | -65% | -55% | -35% | -15% | +10% | +35% | +65% | +90% |
| AI Agents | 50% | 1.35ร | -50% | -40% | -15% | +10% | +35% | +70% | +110% | +145% |
| Multi-Agent | 50% | 1.4ร | -45% | -35% | -10% | +15% | +45% | +85% | +130% | +170% |
| MoE Inference | 55% | 1.5ร | -35% | -25% | 0% | +30% | +67% | +115% | +165% | +210% |
| Test-Time Compute | 55% | 1.45ร | -40% | -30% | -5% | +25% | +60% | +105% | +155% | +195% |
7B-8B Negative ROI - F5 not recommended 13B-32B Marginal - workload dependent 70B+ Positive ROI - F5 sweet spot
๐ Frontier Models (MoE Architectures 2025-2027)
Ultra-scale models use Mixture-of-Experts (MoE) architectures where only a subset of parameters (typically 5-30B) are activated per inference, dramatically reducing compute requirements while maintaining full model capability.
Key Insight: MoE architectures deliver frontier capability with dramatically lower inference cost. F5 DPU benefits increase with model scale due to more complex expert routing, KV cache management, and orchestration overhead. Models at 1T+ scale represent the highest F5 ROI opportunity.
๐ GPU:DPU Ratio Impact
| Ratio | Util Boost | ROI (70B) | Notes |
|---|---|---|---|
| 4:1 | +27% | -28% | Over-provisioned (too many DPUs) |
| 8:1 | +27% | +45% | Optimal balance (recommended) |
| 16:1 | +19% | +97% | Higher ROI, reduced boost |
| 32:1 | +11% | +121% | Sparse - diminishing returns |
โ๏ธ Key Assumptions
- H100 baseline: 700W, $30-40K/GPU
- PUE: 1.4 (industry standard)
- DPU power: 50W each
- Electricity: $0.12/kWh default
- Base $/token: $0.0000005
- Discount rate: 12% default
- License term: 3 years default
- Utilization floor: 85%
๐ง DPU Hardware Cost Model
| Model | Description | Year 0 Impact | Use Case |
|---|---|---|---|
| Bundled | DPU hardware included in GPU infrastructure package | $0 additional | New cluster deployments with integrated DPU, OEM partnerships |
| Add-on | DPU hardware purchased separately | DPUs ร $3,800/unit | Retrofitting existing clusters, standalone DPU procurement |
๐ณ License Payment Model
| Model | Description | Year 0 | Years 1-N |
|---|---|---|---|
| CapEx (Upfront) | Full multi-year license paid upfront | Annual ร Term | $0 |
| Subscription | Annual payments at start of each period | Year 1 sub | Next year's sub* |
* Subscription payments are made at the start of each period (pay for Year 2 during Year 1, etc.). Last year has no payment.
๐ Cash Flow Examples (3-Year Term)
Example: 1000 GPUs, 8:1 ratio (125 DPUs), $10,000/year annual F5 license
| Year | Investment | F5 License | Benefits |
|---|---|---|---|
| Year 0 | -$30,000 | -$30,000 | $0 |
| Year 1 | $0 | $0 | +Benefits |
| Year 2 | $0 | $0 | +Benefits |
| Year 3 | $0 | $0 | +Benefits |
| Year | Investment | F5 License | Benefits |
|---|---|---|---|
| Year 0 | -$10,000 | -$10,000 | $0 |
| Year 1 | $0 | -$10,000 | +Benefits |
| Year 2 | $0 | -$10,000 | +Benefits |
| Year 3 | $0 | $0 | +Benefits |
| Year | DPU HW | F5 License | Total Outflow | Benefits | Net Cash Flow |
|---|---|---|---|---|---|
| Year 0 | -$475,000 | -$10,000 | -$485,000 | $0 | -$485,000 |
| Year 1 | $0 | -$10,000 | -$10,000 | +Benefits | Benefits - $10K |
| Year 2 | $0 | -$10,000 | -$10,000 | +Benefits | Benefits - $10K |
| Year 3 | $0 | $0 | $0 | +Benefits | Benefits only |
- ROI Calculation: Always uses annualized F5 cost regardless of payment model (for comparability)
- Cash Flow Tab: Shows actual payment timing based on selected model
- NPV/IRR: Currently uses simplified annual net benefit (future enhancement: payment-timing-aware NPV)
๐ฏ Break-Even Analysis
- 70B + RAG workload: ~25% ROI (break-even threshold)
- 70B + Basic workload: Negative ROI (not recommended)
- 175B + Basic workload: ~51% ROI (viable even for simple workloads)
- Small models (<32B): Not recommended for F5 deployment
๐ Token-Based Pricing Models (Models 3 & 4)
Model 3: Per-Token Pricing
F5 charges a fraction of every token processed through its DPU infrastructure. This aligns F5's revenue directly with the customer's usage volume.
Customer Annual Revenue = Total_Tokens/sec ร 3600 ร Active_Hours/yr ร Customer_Price/Token
Value Gap = Customer Revenue โ F5 Revenue (what customer keeps)
Where: Total_Tokens/sec includes the F5-enhanced throughput across the entire GPU fleet. Active_Hours = Operational Hours ร Token Utilization %. Default F5 price ($0.015/1M tokens) represents ~1% of a typical blended market rate.
Model 4: Incremental Token Pricing
F5 charges only for the additional tokens/sec enabled by DPU deployment โ the throughput uplift beyond baseline. This is the purest "pay for value" model.
F5 Annual Revenue = Incremental_Tokens/sec ร 3600 ร Active_Hours/yr ร F5_Price/Token
Customer Incr. Revenue = Incremental_Tokens/sec ร 3600 ร Active_Hours/yr ร Customer_Price/Token
Key insight: This model directly ties F5's revenue to the measurable throughput improvement. If F5 DPUs provide a 30% throughput boost, F5 charges on that 30% incremental capacity only.
๐ Market Token Pricing Reference (March 2026)
| GPT-4o | $2.50 in / $10.00 out |
| GPT-5.2 | $1.75 in / $14.00 out |
| Claude Sonnet 4.6 | $3.00 in / $15.00 out |
| Claude Opus 4.6 | $5.00 in / $25.00 out |
| Gemini 2.5 Pro | $1.25 in / $10.00 out |
| Llama 4 Maverick | $0.27 in / $0.85 out |
| Llama 70B (Groq) | $0.59 in / $0.79 out |
| Llama 70B (Together) | $0.90 blended |
| Claude Haiku 4.5 | $1.00 in / $5.00 out |
| Gemini Flash | $0.50 in / $3.00 out |
Prices are blended input/output rates sourced from provider APIs as of March 2026. F5's default rate ($0.015/1M tokens) is set at ~1% of a typical customer blended rate, representing the infrastructure-layer value capture.
๐ Data Sources & References
๐ฅ๏ธ GPU Hardware Specifications
- NVIDIA H100/H200: NVIDIA H100 Tensor Core GPU - 80GB HBM3, 3.35 TB/s, 700W TDP
- NVIDIA B100/B200: NVIDIA Blackwell Architecture - 192GB HBM3e, 8 TB/s bandwidth
- NVIDIA GB200/GB300: NVIDIA Grace Blackwell Superchip - GTC 2024 announcement
- NVIDIA Vera Rubin (R200): NVIDIA Rubin Platform - 288GB HBM4, 22 TB/s, 1800W TDP (2H 2026)
- NVIDIA Rubin NVL72: Vera Rubin NVL72 - 5ร inference perf, 10ร lower cost/token vs Blackwell
- Memory Bandwidth: NVIDIA Hopper Architecture Deep Dive
- Tensor Core Generations: NVIDIA Ampere Whitepaper (PDF)
๐ฐ GPU Pricing & Market Data
- H100 SXM5 ($25-40K): Tom's Hardware H100 Pricing Analysis
- Cloud GPU Hourly Rates: CoreWeave GPU Cloud Pricing - H100: $2.49-4.25/hr
- Hyperscaler Rates: AWS P5 Instances | Azure ND Series
- GPU Market Analysis: SemiAnalysis - Industry GPU economics reports
๐ข Neocloud Economics & Financials
- IREN (Iris Energy): SEC EDGAR - IREN Filings - S-1/10-K for GPU cloud economics
- CoreWeave: CoreWeave Blog - Infrastructure economics, investor presentations
- Nebius: Nebius Investor Relations - GPU cloud unit economics
- Lambda Labs: Lambda GPU Cloud - Pricing benchmarks
โก Token Pricing & Inference Benchmarks
- OpenAI Pricing: OpenAI API Pricing - GPT-4o, GPT-4 Turbo rates
- Anthropic Pricing: Anthropic API Pricing - Claude 3.5/4 rates
- Together.ai: Together AI Pricing - Open model inference costs
- Fireworks.ai: Fireworks Pricing - Serverless inference rates
- MLPerf Inference: MLCommons Inference Benchmarks
๐ต Market Token Pricing Reference Dec 2025
Reference rates from major inference providers. The calculator uses the Blended Average column by default. You can override with a custom rate in the sidebar.
Note: Prices shown are per 1M output tokens (input typically 50-80% cheaper). Calculator defaults are set higher than spot rates to reflect enterprise SLA pricing, burst capacity premiums, and self-hosted margin targets. DeepSeek-V3 pricing is anomalously low due to MoE efficiencyโadjust upward for non-MoE frontier models.
๐ง DPU Technology & Infrastructure
- F5 BIG-IP Next: F5 BIG-IP Product Page
- NVIDIA BlueField-3: NVIDIA DPU Overview
- AMD Pensando: AMD Pensando DPU
- Data Center PUE: Google Data Center Efficiency - Industry PUE benchmarks (1.1-1.5)
๐ง Model Architecture References
- Llama 3.1 405B: Meta AI - Llama 3.1 Release
- DeepSeek-V3 671B: DeepSeek-V3 Technical Report
- MoE Architectures: Mixtral of Experts (arXiv)
- KV Cache Optimization: PagedAttention (vLLM)
- Disaggregated Serving: Splitwise: Disaggregated LLM Serving
๐ GPU-Model Capacity Table Notes
The 1ร/2ร/4ร/8ร GPU columns show per-server tensor parallelism requirements:
- 1ร GPU: Single GPU inference (no parallelism needed)
- 2ร/4ร GPU: Tensor parallelism within a single multi-GPU node
- 8ร GPU: Full 8-GPU server (e.g., DGX H100) with NVLink/NVSwitch
- Memory calculation: FP16 weights (~2 bytes/param) + 65% KV cache utilization
- Multi-node: For models exceeding 8ร GPU capacity, requires NVSwitch fabric or pipeline parallelism
Last Updated: December 2025 | Data sources verified as of publication date. GPU pricing and cloud rates subject to market conditions.
This calculator provides estimates for planning purposes only. Actual ROI will vary based on specific workloads, infrastructure configurations, market conditions, and operational factors. Consult with F5 sales engineering for detailed assessments.
๐ Appendix
Detailed NPV Calculation (256 GPUs)
* Discount rate: 12% | License term: 3 years | Discount factors: 1.12, 1.25, 1.40