Skip to content
← Back to Dashboard

AdaptLLM™ Inference

AdaptLLM™ is Adaptensor's TPU-accelerated large language model inference engine. It powers document understanding, semantic search, and natural language queries across your private data.

Why TPUs for LLM Inference?

Factor NVIDIA GPUs Google TPUs
Raw FLOPS (BF16) 312 TFLOPS (H100) 275 TFLOPS (v5p)
Cost per hour $8-12 $4-6
Memory bandwidth 3.35 TB/s 4.8 TB/s
Availability Constrained Good
Cost efficiency Baseline 2-3x better

TPUs excel at the dense matrix multiplications that dominate transformer inference.

AdaptLLM Architecture

┌─────────────────────────────────────────────────────────────┐
│                     AdaptLLM™ Engine                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Embedding  │  │  Retrieval  │  │  Generation │        │
│  │   Model     │  │   Ranker    │  │    (LLM)    │        │
│  │  (384-dim)  │  │  (Cross-    │  │  (7B-70B)   │        │
│  │             │  │   encoder)  │  │             │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│         │                │                │                │
│         └────────────────┼────────────────┘                │
│                          │                                  │
│                    AdaptCore™                               │
│              (Bucketing, Adapters, Early Exit)              │
│                          │                                  │
│                    ┌─────┴─────┐                           │
│                    │  TPU v2-8 │                           │
│                    │  180 TFLOPS│                           │
│                    └───────────┘                           │
└─────────────────────────────────────────────────────────────┘

Processing Pipeline

1. Document Ingestion

When you upload documents, AdaptLLM:

Document → Chunking → Embedding → AdaptHex™ → Storage
   │          │           │           │          │
   │          │           │           │          └─ gs://adaptensor-indexes/{uid}/
   │          │           │           └─ 4-8x compression
   │          │           └─ 384-dim vectors on TPU
   │          └─ Smart paragraph/section splitting
   └─ PDF, DOCX, TXT, HTML, MD

Performance: - 10x faster than GPU-based embedding - 762 chunks indexed in 170 seconds (our aviation test) - ~4.5 chunks/second sustained throughput

When you query:

Query → Embed → Search → Rank → Return
  │       │       │        │       │
  │       │       │        │       └─ Top-k results with scores
  │       │       │        └─ Cross-encoder reranking
  │       │       └─ Cosine similarity on AdaptHex vectors
  │       └─ Same embedding model as indexing
  └─ Natural language question

Performance: - 18ms average query latency - Sub-second even on million-document indexes - Accuracy: 99.6% vs uncompressed (AdaptHex)

3. RAG (Retrieval-Augmented Generation)

For question-answering over your documents:

Query → Retrieve (top 5) → Construct Prompt → Generate → Answer
                    ┌─────────────────────────────┐
                    │ System: You are a helpful   │
                    │ assistant. Use ONLY the     │
                    │ provided context.           │
                    │                             │
                    │ Context:                    │
                    │ [Retrieved chunks...]       │
                    │                             │
                    │ Question: {user query}      │
                    └─────────────────────────────┘

Model Stack

Component Model Parameters Purpose
Embedder all-MiniLM-L6-v2 22M Fast semantic vectors
Reranker cross-encoder/ms-marco 110M Precision ranking
Generator Llama-3-8B (quantized) 8B Answer generation

All models run on TPU via AdaptCore™ middleware.

Multi-Tenant Inference

AdaptLLM handles multiple users on shared TPU infrastructure:

User A Query ─┐
              │     ┌──────────────┐     ┌─────────────┐
User B Query ─┼────▶│  Job Queue   │────▶│  TPU Pool   │
              │     │ (Cloud Tasks)│     │  (Shared)   │
User C Query ─┘     └──────────────┘     └─────────────┘
                    ┌───────────────────────────┼───────────────────────────┐
                    │                           │                           │
                    ▼                           ▼                           ▼
            ┌───────────────┐           ┌───────────────┐           ┌───────────────┐
            │ User A's Data │           │ User B's Data │           │ User C's Data │
            │ (Isolated)    │           │ (Isolated)    │           │ (Isolated)    │
            └───────────────┘           └───────────────┘           └───────────────┘

Isolation guarantees: - Each query only accesses that user's indexes - No cross-tenant data leakage - Separate billing per user - Optional per-user adapter weights

Performance Benchmarks

Metric Adaptensor (TPU) GPU Alternatives
Embedding throughput 1,000 chunks/min 100 chunks/min
Query latency (p50) 18ms 50-100ms
Query latency (p99) 45ms 200-500ms
Cost per 1M embeddings $0.10 $0.50-1.00
Cold start time 30s 60-120s

Configuration Options

from adaptensor import AdaptensorClient

client = AdaptensorClient(api_key="sk_live_...")

# Configure inference behavior
response = client.query(
    index_name="contracts",
    query="What are the termination clauses?",
    options={
        "model": "default",           # or "fast", "accurate"
        "top_k": 10,                  # Results to retrieve
        "rerank": True,               # Use cross-encoder
        "generate_answer": True,      # RAG mode
        "max_tokens": 500,            # Answer length limit
        "temperature": 0.1,           # Lower = more focused
    }
)

Security & Privacy

AdaptLLM maintains strict data isolation:

  • ✅ Your documents never leave your cloud storage
  • ✅ No training on customer data
  • ✅ No external API calls (OpenAI, Anthropic, etc.)
  • ✅ SOC2, HIPAA, GDPR compliance ready
  • ✅ Full audit logging

Next Steps