← Back to Labs
137 Particles Labs Observatory
We test on METAL MAC MINIS NVIDIA 4090 AMD RX 7900 XTX RYZEN 9 9950X RYZEN HX 370 AI APPLE M4 MAX XEON SCALABLE NVIDIA 3090
Synthetic benchmarks are useless. We measure tokens-per-second, memory bandwidth saturation, and quantization loss on the hardware you actually own.
Embedding Efficiency
98.4%
MiniLM (384) vs Large (1024)
Quantization Sweet Spot
Q4_K_M
Best Speed/Perplexity Ratio
Models Cataloged
1,240
Across 50+ Architectures
Dataset A
Small Vector vs. Large Vector
| Model | Dimensions | Size (RAM) | Retrieval Score (MTEB) | Throughput (Docs/Sec) |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 80 MB | 58.4 | 14,000 |
| bge-large-en-v1.5 | 1024 | 1.34 GB | 64.2 | 850 |
| openai-text-3-large | 3072 | N/A (API) | 64.6 | Latency Bound |
Conclusion: Unless you are
doing hyperspecialized legal/medical retrieval, the 384-dim model
offers 16x throughput for a 9% retrieval penalty. For local agents,
speed is context.
Dataset B
The Quantization Tax
| Quant Level | Model Size | Perplexity (Lower is Better) | Tokens/Sec (RTX 3090) | VRAM Bus Load |
|---|---|---|---|---|
| FP16 (Uncompressed) | 13.6 GB | 5.42 | 45 t/s | 98% (Bottleneck) |
| Q4_K_M (Recommended) | 4.2 GB | 5.51 | 110 t/s | 45% (Optimal) |
| Q2_K | 2.8 GB | 6.80 | 115 t/s | 30% |
Conclusion: Dropping from
FP16 to Q4_K_M yields a 2.4x speed increase with negligible (<0.1)
perplexity loss. Q2 degrades reasoning significantly for minimal
speed gain.
Contribute your Hardware Data
Our benchmarking tool is open source. Download the CLI, run the standard suite on your specific rig, and push the results to our public repository.