AdaptLLM™ Inference¶
AdaptLLM™ is Adaptensor's TPU-accelerated large language model inference engine. It powers document understanding, semantic search, and natural language queries across your private data.
Why TPUs for LLM Inference?¶
| Factor | NVIDIA GPUs | Google TPUs |
|---|---|---|
| Raw FLOPS (BF16) | 312 TFLOPS (H100) | 275 TFLOPS (v5p) |
| Cost per hour | $8-12 | $4-6 |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s |
| Availability | Constrained | Good |
| Cost efficiency | Baseline | 2-3x better |
TPUs excel at the dense matrix multiplications that dominate transformer inference.
AdaptLLM Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ AdaptLLM™ Engine │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Embedding │ │ Retrieval │ │ Generation │ │
│ │ Model │ │ Ranker │ │ (LLM) │ │
│ │ (384-dim) │ │ (Cross- │ │ (7B-70B) │ │
│ │ │ │ encoder) │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ AdaptCore™ │
│ (Bucketing, Adapters, Early Exit) │
│ │ │
│ ┌─────┴─────┐ │
│ │ TPU v2-8 │ │
│ │ 180 TFLOPS│ │
│ └───────────┘ │
└─────────────────────────────────────────────────────────────┘
Processing Pipeline¶
1. Document Ingestion¶
When you upload documents, AdaptLLM:
Document → Chunking → Embedding → AdaptHex™ → Storage
│ │ │ │ │
│ │ │ │ └─ gs://adaptensor-indexes/{uid}/
│ │ │ └─ 4-8x compression
│ │ └─ 384-dim vectors on TPU
│ └─ Smart paragraph/section splitting
└─ PDF, DOCX, TXT, HTML, MD
Performance: - 10x faster than GPU-based embedding - 762 chunks indexed in 170 seconds (our aviation test) - ~4.5 chunks/second sustained throughput
2. Semantic Search¶
When you query:
Query → Embed → Search → Rank → Return
│ │ │ │ │
│ │ │ │ └─ Top-k results with scores
│ │ │ └─ Cross-encoder reranking
│ │ └─ Cosine similarity on AdaptHex vectors
│ └─ Same embedding model as indexing
└─ Natural language question
Performance: - 18ms average query latency - Sub-second even on million-document indexes - Accuracy: 99.6% vs uncompressed (AdaptHex)
3. RAG (Retrieval-Augmented Generation)¶
For question-answering over your documents:
Query → Retrieve (top 5) → Construct Prompt → Generate → Answer
│
▼
┌─────────────────────────────┐
│ System: You are a helpful │
│ assistant. Use ONLY the │
│ provided context. │
│ │
│ Context: │
│ [Retrieved chunks...] │
│ │
│ Question: {user query} │
└─────────────────────────────┘
Model Stack¶
| Component | Model | Parameters | Purpose |
|---|---|---|---|
| Embedder | all-MiniLM-L6-v2 | 22M | Fast semantic vectors |
| Reranker | cross-encoder/ms-marco | 110M | Precision ranking |
| Generator | Llama-3-8B (quantized) | 8B | Answer generation |
All models run on TPU via AdaptCore™ middleware.
Multi-Tenant Inference¶
AdaptLLM handles multiple users on shared TPU infrastructure:
User A Query ─┐
│ ┌──────────────┐ ┌─────────────┐
User B Query ─┼────▶│ Job Queue │────▶│ TPU Pool │
│ │ (Cloud Tasks)│ │ (Shared) │
User C Query ─┘ └──────────────┘ └─────────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ User A's Data │ │ User B's Data │ │ User C's Data │
│ (Isolated) │ │ (Isolated) │ │ (Isolated) │
└───────────────┘ └───────────────┘ └───────────────┘
Isolation guarantees: - Each query only accesses that user's indexes - No cross-tenant data leakage - Separate billing per user - Optional per-user adapter weights
Performance Benchmarks¶
| Metric | Adaptensor (TPU) | GPU Alternatives |
|---|---|---|
| Embedding throughput | 1,000 chunks/min | 100 chunks/min |
| Query latency (p50) | 18ms | 50-100ms |
| Query latency (p99) | 45ms | 200-500ms |
| Cost per 1M embeddings | $0.10 | $0.50-1.00 |
| Cold start time | 30s | 60-120s |
Configuration Options¶
from adaptensor import AdaptensorClient
client = AdaptensorClient(api_key="sk_live_...")
# Configure inference behavior
response = client.query(
index_name="contracts",
query="What are the termination clauses?",
options={
"model": "default", # or "fast", "accurate"
"top_k": 10, # Results to retrieve
"rerank": True, # Use cross-encoder
"generate_answer": True, # RAG mode
"max_tokens": 500, # Answer length limit
"temperature": 0.1, # Lower = more focused
}
)
Security & Privacy¶
AdaptLLM maintains strict data isolation:
- ✅ Your documents never leave your cloud storage
- ✅ No training on customer data
- ✅ No external API calls (OpenAI, Anthropic, etc.)
- ✅ SOC2, HIPAA, GDPR compliance ready
- ✅ Full audit logging
Next Steps¶
- AdaptCore™ Middleware - How we make TPUs work with dynamic models
- AdaptHex™ Compression - Vector compression technology
- Architecture Overview - Full system design