AdaptLLM™ Inference¶

AdaptLLM™ is Adaptensor's TPU-accelerated large language model inference engine. It powers document understanding, semantic search, and natural language queries across your private data.

Why TPUs for LLM Inference?¶

Factor	NVIDIA GPUs	Google TPUs
Raw FLOPS (BF16)	312 TFLOPS (H100)	275 TFLOPS (v5p)
Cost per hour	$8-12	$4-6
Memory bandwidth	3.35 TB/s	4.8 TB/s
Availability	Constrained	Good
Cost efficiency	Baseline	2-3x better

TPUs excel at the dense matrix multiplications that dominate transformer inference.

AdaptLLM Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                     AdaptLLM™ Engine                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Embedding  │  │  Retrieval  │  │  Generation │        │
│  │   Model     │  │   Ranker    │  │    (LLM)    │        │
│  │  (384-dim)  │  │  (Cross-    │  │  (7B-70B)   │        │
│  │             │  │   encoder)  │  │             │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│         │                │                │                │
│         └────────────────┼────────────────┘                │
│                          │                                  │
│                    AdaptCore™                               │
│              (Bucketing, Adapters, Early Exit)              │
│                          │                                  │
│                    ┌─────┴─────┐                           │
│                    │  TPU v2-8 │                           │
│                    │  180 TFLOPS│                           │
│                    └───────────┘                           │
└─────────────────────────────────────────────────────────────┘

Processing Pipeline¶

1. Document Ingestion¶

When you upload documents, AdaptLLM:

Document → Chunking → Embedding → AdaptHex™ → Storage
   │          │           │           │          │
   │          │           │           │          └─ gs://adaptensor-indexes/{uid}/
   │          │           │           └─ 4-8x compression
   │          │           └─ 384-dim vectors on TPU
   │          └─ Smart paragraph/section splitting
   └─ PDF, DOCX, TXT, HTML, MD

Performance: - 10x faster than GPU-based embedding - 762 chunks indexed in 170 seconds (our aviation test) - ~4.5 chunks/second sustained throughput

2. Semantic Search¶

When you query:

Query → Embed → Search → Rank → Return
  │       │       │        │       │
  │       │       │        │       └─ Top-k results with scores
  │       │       │        └─ Cross-encoder reranking
  │       │       └─ Cosine similarity on AdaptHex vectors
  │       └─ Same embedding model as indexing
  └─ Natural language question

Performance: - 18ms average query latency - Sub-second even on million-document indexes - Accuracy: 99.6% vs uncompressed (AdaptHex)

3. RAG (Retrieval-Augmented Generation)¶

For question-answering over your documents:

Query → Retrieve (top 5) → Construct Prompt → Generate → Answer
                                    │
                                    ▼
                    ┌─────────────────────────────┐
                    │ System: You are a helpful   │
                    │ assistant. Use ONLY the     │
                    │ provided context.           │
                    │                             │
                    │ Context:                    │
                    │ [Retrieved chunks...]       │
                    │                             │
                    │ Question: {user query}      │
                    └─────────────────────────────┘

Model Stack¶

Component	Model	Parameters	Purpose
Embedder	all-MiniLM-L6-v2	22M	Fast semantic vectors
Reranker	cross-encoder/ms-marco	110M	Precision ranking
Generator	Llama-3-8B (quantized)	8B	Answer generation

All models run on TPU via AdaptCore™ middleware.

Multi-Tenant Inference¶

AdaptLLM handles multiple users on shared TPU infrastructure:

User A Query ─┐
              │     ┌──────────────┐     ┌─────────────┐
User B Query ─┼────▶│  Job Queue   │────▶│  TPU Pool   │
              │     │ (Cloud Tasks)│     │  (Shared)   │
User C Query ─┘     └──────────────┘     └─────────────┘
                                                │
                    ┌───────────────────────────┼───────────────────────────┐
                    │                           │                           │
                    ▼                           ▼                           ▼
            ┌───────────────┐           ┌───────────────┐           ┌───────────────┐
            │ User A's Data │           │ User B's Data │           │ User C's Data │
            │ (Isolated)    │           │ (Isolated)    │           │ (Isolated)    │
            └───────────────┘           └───────────────┘           └───────────────┘

Isolation guarantees: - Each query only accesses that user's indexes - No cross-tenant data leakage - Separate billing per user - Optional per-user adapter weights

Performance Benchmarks¶

Metric	Adaptensor (TPU)	GPU Alternatives
Embedding throughput	1,000 chunks/min	100 chunks/min
Query latency (p50)	18ms	50-100ms
Query latency (p99)	45ms	200-500ms
Cost per 1M embeddings	$0.10	$0.50-1.00
Cold start time	30s	60-120s

Configuration Options¶

from adaptensor import AdaptensorClient

client = AdaptensorClient(api_key="sk_live_...")

# Configure inference behavior
response = client.query(
    index_name="contracts",
    query="What are the termination clauses?",
    options={
        "model": "default",           # or "fast", "accurate"
        "top_k": 10,                  # Results to retrieve
        "rerank": True,               # Use cross-encoder
        "generate_answer": True,      # RAG mode
        "max_tokens": 500,            # Answer length limit
        "temperature": 0.1,           # Lower = more focused
    }
)

Security & Privacy¶

AdaptLLM maintains strict data isolation:

✅ Your documents never leave your cloud storage
✅ No training on customer data
✅ No external API calls (OpenAI, Anthropic, etc.)
✅ SOC2, HIPAA, GDPR compliance ready
✅ Full audit logging

Next Steps¶

AdaptCore™ Middleware - How we make TPUs work with dynamic models
AdaptHex™ Compression - Vector compression technology
Architecture Overview - Full system design