AdaptCore™ Middleware¶
AdaptCore™ is Adaptensor's proprietary TPU middleware that enables dynamic PyTorch and JAX models to run efficiently on Google's Tensor Processing Units (TPUs).
The Problem: Static Shape Barrier¶
TPUs require static tensor shapes for optimal performance. The XLA compiler needs to know exact dimensions at compile time. But most real-world AI workloads are dynamic:
- Variable-length text (tweets vs. legal documents)
- Changing batch sizes (traffic spikes, streaming data)
- Conditional execution (early stopping, branching logic)
When shapes change, XLA must recompile—causing latency spikes and wasted compute.
The Solution: AdaptCore™¶
AdaptCore inserts a translation layer between your model and the TPU:
┌─────────────────┐
│ Your Model │ (Dynamic shapes, any framework)
│ PyTorch / JAX │
└────────┬────────┘
│
▼
┌─────────────────┐
│ AdaptCore™ │ ← Shape normalization
│ Middleware │ ← Adapter injection
│ │ ← Early exit logic
└────────┬────────┘
│
▼
┌─────────────────┐
│ Google TPU │ (Static shapes, XLA-compiled)
│ v2/v4/v5 │
└─────────────────┘
Three Pillars of AdaptCore™¶
1. Adaptive Bucketing & Padding¶
Instead of sending variable-length inputs directly to the TPU, AdaptCore:
- Inspects incoming batch shapes
- Assigns each batch to a bucket (128, 256, 512, 1024 tokens)
- Pads inputs to the bucket size
- Masks padded positions in attention and loss computation
# Example: Input of 347 tokens → Bucket 512
# XLA compiles once for 512, reuses for all 300-512 token inputs
Result: XLA compiles 3-4 variants instead of thousands. Throughput stays consistent.
2. Tensor Adapters (LoRA Hot-Swap)¶
Large models (7B+ parameters) are expensive to reload. AdaptCore keeps a frozen backbone in TPU memory and injects small adapter matrices:
output = (W @ x) + (A @ B @ x) * scale
Where:
- W = frozen backbone weight (7B params, never changes)
- A, B = adapter matrices (1-10M params per user/task)
- scale = normalization factor
Benefits:
| Metric | Without Adapters | With AdaptCore™ |
|---|---|---|
| Model switch time | 30-60 seconds | < 100ms |
| Memory per user | 14GB+ | ~50MB |
| Users per TPU | 1 | 100+ |
3. Entropy-Based Early Exit¶
Not every query needs all model layers. AdaptCore measures prediction confidence after each layer:
def entropy(logits):
probs = softmax(logits)
return -sum(probs * log(probs))
# Low entropy = high confidence = exit early
# High entropy = uncertain = continue to next layer
XLA-Compatible Implementation:
# Fixed loop bounds + numerical masking
for i in range(MAX_LAYERS):
x = layer[i](x)
if not finished:
confidence = 1.0 - entropy(x) / MAX_ENTROPY
if confidence > threshold:
finished = True
final_output = x
This gives dynamic depth behavior while maintaining static graph structure for XLA.
Performance Impact:
| Query Type | Layers Used | Speedup |
|---|---|---|
| "What is 2+2?" | 3/32 | 10x |
| "Explain quantum entanglement" | 28/32 | 1.1x |
| Average workload | 12/32 | 2.7x |
Technical Specifications¶
| Specification | Value |
|---|---|
| Supported TPUs | v2-8, v4-8, v5p (pods) |
| Supported Frameworks | JAX, PyTorch (via PyTorch/XLA) |
| Bucket Sizes | 128, 256, 512, 1024, 2048 |
| Adapter Rank | 8-64 (configurable) |
| Early Exit Threshold | 0.7-0.95 (configurable) |
| Compilation Overhead | ~30s first request, <1ms subsequent |
Using AdaptCore™¶
AdaptCore is automatically used when you interact with Adaptensor's APIs. You don't need to configure anything—just send your documents and queries.
For advanced users who want direct TPU access:
from adaptensor import AdaptensorClient
client = AdaptensorClient(api_key="sk_live_...")
# AdaptCore handles bucketing, adapters, and early exit automatically
results = client.query(
index_name="my-documents",
query="What are the key findings?",
options={
"early_exit_threshold": 0.85, # Optional tuning
"adapter": "legal-v2" # Optional custom adapter
}
)
Comparison: With vs Without AdaptCore™¶
| Metric | Raw TPU | With AdaptCore™ |
|---|---|---|
| Dynamic shape support | ❌ Manual bucketing | ✅ Automatic |
| Multi-tenant | ❌ One user per TPU | ✅ 100+ users |
| Framework support | TensorFlow only | PyTorch, JAX, TF |
| Compilation stalls | Frequent | Rare |
| Cost efficiency | Baseline | 2-3x better |
Next Steps¶
- AdaptHex™ Compression - How we reduce storage costs
- AdaptLLM™ Inference - TPU-accelerated LLM processing
- Architecture Overview - Full system design