Comprehensive performance analysis of Phala Cloud’s TEE infrastructure.
Component | Environment | Key Metric | Performance Gain | Critical Advantage |
---|---|---|---|---|
GPU Acceleration | TEE + H200 | INT8 Compute | 3341 TFLOPS | 6.9x vs AWS GPUs |
CPU Efficiency | TDX-Lab | Per-core Compression | 1,271 MIPS | 48% higher than production systems |
CPU Scaling Efficiency | TEE CVM | Scaling efficiency compared to single core | 79.91% | 10% higher than bare metal servers |
Disk I/O | TEE VM | Sequential Write | 36.2 GB/s | Matches bare-metal performance |
Network Gateway | TDX-Lab | QPS @1000cc | 1,244.85 | 594% higher than production systems |
Security Overhead | zkTLS | Verification Speed | 18.3 kHz | 5.12x slowdown mitigated at scale |
Metric | H200 Host | H200 CVM | TDX-Lab Host | TDX-Lab CVM | TDX-Lab VM |
---|---|---|---|---|---|
Random R/W IOPS (k) | 259/156 | 33/34 | 97.3/32 | 43/37 | 72/75 |
Random R/W Bandwidth (MiB/s) | 3642/3356 | 2071/1389 | 519/381 | 1080/1193 | 12253/4215 |
Avg Latency R/W (μs) | 85/24 | 892/955 | 213/99 | 482/190 | 63/68 |
Sequential R/W (MiB/s) | 4338/3413 | 9973/2293 | 526/377 | 1168/412 | 36230/3829 |
Mixed R/W IOPS (k) | 129/43 | 18.7/6.2 | 67.7/23.5 | 43.6/14.5 | 71/23 |
Metric | Concurrency | Prod5 | TDX-Lab | Gain | Latency Advantage |
---|---|---|---|---|---|
QPS | 200 | 206.46 | 1189.93 | +476% | 5.76x |
500 | 194.70 | 1228.82 | +531% | 6.31x | |
1000 | 179.33 | 1244.85 | +594% | 6.94x | |
2000 | 182.39 | 1038.60 | +469% | 5.70x | |
P99 Latency (ms) | 200 | 1,254 | 173 | -86% | 7.25x |
500 | 4,099 | 455 | -89% | 9.01x | |
1000 | 7,768 | 822 | -89% | 9.45x | |
2000 | 19,950 | 1,672 | -92% | 11.93x | |
Error Rate | 2000 | 1.36% | 0% | -100% | Absolute advantage |
Max Connect Time (ms) | 2000 | 27,084 | 1,677 | -94% | 16.15x |
Access Method | Environment | QPS | Avg Latency | Overhead Source |
---|---|---|---|---|
Direct (HTTP) | TDX-Lab | 11,132 | 89.8ms | Baseline |
TProxy (HTTPS) | TDX-Lab | 1,264 | 791.3ms | +702% latency |
Environment | Gateway Ver. | Log Level | Total QPS | CPU Usage | Bottleneck |
---|---|---|---|---|---|
TDX-Lab | 7bc9eea | error | 15,000 | 94.5% | CPU saturation |
Prod8 | 7bc9eea | error | 22,400 | 19.2% | Network interrupt |
Prod8 | 7bc9eea | info | 15,000 | 14.7% | Log I/O blocking |
Environment | Total Time (s) | Proof Time (s) | Speed (kHz) | Proof Size (bytes) | TEE Overhead |
---|---|---|---|---|---|
CPU | 166.54 | 166.30 | 98,479 | 8,340,752 | Baseline |
TEE CPU | 628.78 | 628.58 | 24,704 | 25,123,323 | 3.78x |
TEE GPU | 852.98 | 852.73 | 18,312 | 25,123,323 | 5.12x |
Environment | Hash Speed (MB/s) | Memory BW Utilization |
---|---|---|
CPU | 44.50 | 100% |
TEE CPU | 11.77 | 26.4% |
TEE GPU | 8.68 | 19.5% |
System Config | Compress MIPS | Cores | Per-core Efficiency |
---|---|---|---|
TDX-Lab | 40,677 | 32 | 1,271 |
Prod8 | 31,658 | 288 | 110 |
Prod5 | 12,654 | 128 | 99 |
Metric | H200 NVL | AWS G6 (L4) | Advantage |
---|---|---|---|
Memory Capacity | 141GB | 24GB | 5.9x |
Memory Bandwidth | 4.8TB/s | 300GB/s | 16x |
INT8 Compute | 3341 TFLOPS | 485 TFLOPS | 6.9x |
Hourly Cost | $2.5 | $0.805 | 3.1x |
Cost-Performance Ratio | 1.0 | 0.45 | 2.2x |
Cost-Performance Ratio = (Compute/Cost) / H200 BaselineIntegration Benefits
GPU | Model | TPS (TEE-on) | TPS (TEE-off) | TPS Overhead | QPS (TEE-on) | QPS (TEE-off) | QPS Overhead |
---|---|---|---|---|---|---|---|
H100 | Llama-3.1-8B | 123.30 | 132.36 | +6.85% | 18.21 | 18.82 | +3.22% |
Phi3-14B-128k | 66.58 | 69.78 | +4.58% | 7.18 | 7.35 | +2.31% | |
Llama-3.1-70B | 2.48 | 2.48 | -0.13% | 0.83 | 0.83 | -0.36% | |
H200 | Llama-3.1-8B | 121.04 | 132.78 | +8.84% | 29.60 | 32.01 | +7.55% |
Phi3-14B-128k | 68.43 | 72.98 | +6.24% | 12.83 | 13.86 | +7.41% | |
Llama-3.1-70B | 4.08 | 4.18 | +2.29% | 2.19 | 2.20 | +0.63% |
GPU | Model | TTFT (TEE-on) | TTFT (TEE-off) | TTFT Overhead | ITL (TEE-on) | ITL (TEE-off) | ITL Overhead |
---|---|---|---|---|---|---|---|
H100 | Llama-3.1-8B | 0.0288 | 0.0242 | +19.03% | 1.6743 | 1.5549 | +7.67% |
Phi3-14B-128k | 0.0546 | 0.0463 | +18.02% | 3.7676 | 3.5784 | +5.29% | |
Llama-3.1-70B | 0.5108 | 0.5129 | -0.41% | 94.8714 | 95.2395 | -0.39% | |
H200 | Llama-3.1-8B | 0.0364 | 0.0301 | +20.95% | 1.7158 | 1.5552 | +10.33% |
Phi3-14B-128k | 0.0524 | 0.0417 | +25.60% | 3.6975 | 3.4599 | +6.87% | |
Llama-3.1-70B | 0.4362 | 0.4204 | +3.75% | 57.3855 | 55.9771 | +2.52% |
Model | GPU Time Dominance | TEE Overhead |
---|---|---|
Llama-3.1-70B | 99.2% | < 0.5% |
Phi3-14B-128k | 95.7% | 2-7% |
Llama-3.1-8B | 88.3% | 7-26% |
Threads | CPU Usage (%) | Compression (KiB/s) | Decompression (KiB/s) | Total Rating | Scaling Efficiency |
---|---|---|---|---|---|
1 | 100 | 2,511 | 34,387 | 2,803 | 100.00% |
2 | 200 | 9,450 | 68,316 | 7,889 | 140.75% |
4 | 383 | 14,048 | 135,014 | 13,100 | 116.90% |
8 | 760 | 27,361 | 268,126 | 25,821 | 115.17% |
16 | 1,432 | 44,535 | 443,829 | 42,322 | 94.38% |
32 | 2,811 | 72,284 | 783,537 | 71,665 | 79.91% |
Threads | CPU Usage (%) | Compression (KiB/s) | Decompression (KiB/s) | Total Rating | Scaling Efficiency |
---|---|---|---|---|---|
1 | 100 | 4,992 | 43,200 | 4,544 | 100% (baseline) |
2 | 183 | 9,791 | 63,936 | 7,935 | 87.3% |
4 | 382 | 18,895 | 109,687 | 14,689 | 80.9% |
8 | 787 | 34,001 | 217,465 | 27,279 | 75.1% |
16 | 1,559 | 71,892 | 437,145 | 56,853 | 78.1% |
32 | 2,662 | 121,055 | 797,133 | 97,853 | 67.4% |
Thread Range | Efficiency Delta | Primary Benefit |
---|---|---|
1-8 threads | +35.2% avg | Cache locality optimization |
16-32 threads | +12.1% avg | Reduced context-switching |
Metric | TEE Advantage | Technical Cause |
---|---|---|
Initial Scaling | +52.45% (2-thread) | Memory-bound workload optimization |
Mid-range | +41.8% (4-thread) | Reduced hypervisor interference |
High-core | +12.5% (32-thread) | NUMA-aware scheduling |