Phala Network Docs
  • Home
    • 👾Phala Network Docs
  • Overview
    • ⚖️Phala Network
      • 💎Phala Cloud
      • 🥷Dstack
      • 🔐GPU TEE
    • 💎PHA Token
      • 🪙Introduction
      • 👐Delegation
        • Delegate to StakePool
        • What is Vault
        • What is Share
        • WrappedBalances & W-PHA
        • Examples of Delegation
        • Use Phala App to Delegate
        • Estimate Your Reward
      • 🗳️Governance
        • Governance Mechanism
        • Join the Council
        • Voting for Councillors
        • Apply for Project Funding
        • Phala Treasury
        • Phala Governance
        • Setting Up an Account Identity
  • Phala Cloud
    • 🚀Getting Started
      • Create Your Phala Cloud Account
      • Your First CVM Deployment
      • Explore Templates
        • Launch an Eliza Agent
        • Start from Template
    • 🪨TEEs, Attestation & Zero Trust Security
      • Attestation
      • Security Architecture
    • 🥷Phala Cloud User Guides
      • Deploy and Manage CVMs
        • Deploy CVM with Docker Compose
        • Set Secure Environment Variables
        • Deploy Private Docker Image to CVM
        • Debugging and Analyzing Logs
          • Check Logs
          • Private Log Viewer
          • Debug Your Application
        • Application Scaling & Resource Management
        • Upgrade Application
        • Deployment Cheat Sheet
      • Building with TEE
        • Access Your Applications
        • Expose Service Port
        • Setting Up Custom Domain
        • Secure Access Database
        • Create Crypto Wallet
        • Generate Remote Attestation
      • Advanced Deployment Options
        • Deploy CVM with Phala Cloud CLI
        • Deploy CVM with Phala Cloud API
        • Setup a CI/CD Pipeline
    • 🚢Be Production Ready
      • CI/CD Automation
        • Setup a CI/CD Pipeline
      • Production Checklist
      • Troubleshooting Guide
      • Glossary
    • 🔒Use Cases
      • TEE with AI
      • TEE with FHE and MPC
      • TEE with ZK and ZKrollup
    • 📋References
      • Phala Cloud CLI Reference
        • phala
          • auth
          • cvms
          • docker
          • simulator
      • Phala Cloud API & SDKs
        • API Endpoints & Examples
        • SDKs and Integrations
      • Phala Cloud Pricing
    • ❓FAQs
  • Dstack
    • Overview
    • Local Development Guide
    • Getting Started
    • Hardware Requirements
    • Design Documents
      • Whitepaper
      • Decentralized Root-of-Trust
      • Key Management Service
      • Zero Trust HTTPs (TLS)
    • Acknowledgement
    • ❓FAQs
  • LLM in GPU TEE
    • 👩‍💻Host LLM in GPU TEE
    • 🔐GPU TEE Inference API
    • 🏎️GPU TEE Benchmark
    • ❓FAQs
  • Tech Specs
    • ⛓️Blockchain
      • Blockchain Entities
      • Cluster of Workers
      • Secret Key Hierarchy
  • References
    • 🔐Setting Up a Wallet on Phala
      • Acquiring PHA
    • 🌉SubBridge
      • Cross-chain Transfer
      • Supported Assets
      • Asset Integration Guide
      • Technical Details
    • 👷Community Builders
    • 🤹Hackathon Guides
      • ETHGlobal Singapore
      • ETHGlobal San Francisco
      • ETHGlobal Bangkok
    • 🤯Advanced Topics
      • Cross Chain Solutions
      • System Contract and Drivers
      • Run Local Testnet
      • SideVM
    • 🆘Support
      • Available Phala Chains
      • Resource Limits
      • Transaction Costs
      • Compatibility Matrix
      • Block Explorers
      • Faucet
    • ⁉️FAQ
  • Compute Providers
    • 🙃Basic Info
      • Introduction
      • Gemini Tokenomics (Worker Rewards)
      • Budget balancer
      • Staking Mechanism
      • Requirements in Phala
      • Confidence Level & SGX Function
      • Rent Hardware
      • Error Summary
    • 🦿Run Workers on Phala
      • Solo Worker Deployment
      • PRBv3 Deployment
      • Using PRBv3 UI
      • PRB Worker Deployment
      • Switch Workers from Solo to PRB Mode
      • Headers-cache deployment
      • Archive node deployment
    • 🛡️Gatekeeper
      • Collator
      • Gatekeeper
  • Web Directory
    • Discord
    • GitHub
    • Twitter
    • YouTube
    • Forum
    • Medium
    • Telegram
Powered by GitBook
LogoLogo

Participate

  • Compute Providers
  • Node
  • Community
  • About Us

Resources

  • Technical Whitepaper
  • Token Economics
  • Docs
  • GitHub

More

  • Testnet
  • Explorer
  • Careers
  • Responsible Disclosure

COPYRIGHT © 2024 PHALA.LTD ALL RIGHTS RESERVED. May Phala be with you!

On this page

Was this helpful?

Edit on GitHub
  1. LLM in GPU TEE

GPU TEE Benchmark

PreviousGPU TEE Inference APINextFAQs

Last updated 3 months ago

Was this helpful?

The benchmark is based on running LLMs in NVIDIA H100 and H200, our results show that as input size grows, the efficiency of TEE mode increases significantly. When computation time within the GPU dominates overall processing time, the I/O overhead introduced by TEE mode diminishes, allowing efficiency to approach nearly 99%. Efficiency growth is more pronounced in larger models, such as Phi3-14B-128k and Llama3.1-70B, due to their greater computational demands, which result in longer GPU processing times. Consequently, the I/O overhead becomes increasingly trivial as model size increases. The total token size (sum of input and output token size) significantly influences the throughput overhead. Larger total token counts lead to higher efficiencies, as they enhance the ratio of computation time to I/O time. These findings underscore the scalability of TEE mode in handling large-scale LLM inference tasks, particularly as input sizes and model complexities grow. The minimal overhead in high-computation scenarios validates its applicability in secure, high-performance AI workloads.

For more details of metrics and analysis, check the we published earlier.

🏎️
benchmark paper