← Back to Labs
137 Particles Labs Observatory

We test on METAL MAC MINIS NVIDIA 4090 AMD RX 7900 XTX RYZEN 9 9950X RYZEN HX 370 AI APPLE M4 MAX XEON SCALABLE NVIDIA 3090

Synthetic benchmarks are useless. We measure tokens-per-second, memory bandwidth saturation, and quantization loss on the hardware you actually own.

Embedding Efficiency
98.4%
MiniLM (384) vs Large (1024)
Quantization Sweet Spot
Q4_K_M
Best Speed/Perplexity Ratio
Models Cataloged
1,240
Across 50+ Architectures
Dataset A

Small Vector vs. Large Vector

Model Dimensions Size (RAM) Retrieval Score (MTEB) Throughput (Docs/Sec)
all-MiniLM-L6-v2 384 80 MB 58.4 14,000
bge-large-en-v1.5 1024 1.34 GB 64.2 850
openai-text-3-large 3072 N/A (API) 64.6 Latency Bound
Conclusion: Unless you are doing hyperspecialized legal/medical retrieval, the 384-dim model offers 16x throughput for a 9% retrieval penalty. For local agents, speed is context.
Dataset B

The Quantization Tax

Quant Level Model Size Perplexity (Lower is Better) Tokens/Sec (RTX 3090) VRAM Bus Load
FP16 (Uncompressed) 13.6 GB 5.42 45 t/s 98% (Bottleneck)
Q4_K_M (Recommended) 4.2 GB 5.51 110 t/s 45% (Optimal)
Q2_K 2.8 GB 6.80 115 t/s 30%
Conclusion: Dropping from FP16 to Q4_K_M yields a 2.4x speed increase with negligible (<0.1) perplexity loss. Q2 degrades reasoning significantly for minimal speed gain.

Contribute your Hardware Data

Our benchmarking tool is open source. Download the CLI, run the standard suite on your specific rig, and push the results to our public repository.

View Repo