137 Particles Labs Observatory

We test on METAL MAC MINIS NVIDIA 4090 AMD RX 7900 XTX RYZEN 9 9950X RYZEN HX 370 AI APPLE M4 MAX XEON SCALABLE NVIDIA 3090

Synthetic benchmarks are useless. We measure tokens-per-second, memory bandwidth saturation, and quantization loss on the hardware you actually own.

Embedding Efficiency

98.4%

MiniLM (384) vs Large (1024)

Quantization Sweet Spot

Q4_K_M

Best Speed/Perplexity Ratio

Models Cataloged

1,240

Across 50+ Architectures

Dataset A

Small Vector vs. Large Vector

Model	Dimensions	Size (RAM)	Retrieval Score (MTEB)	Throughput (Docs/Sec)
all-MiniLM-L6-v2	384	80 MB	58.4	14,000
bge-large-en-v1.5	1024	1.34 GB	64.2	850
openai-text-3-large	3072	N/A (API)	64.6	Latency Bound

Conclusion: Unless you are doing hyperspecialized legal/medical retrieval, the 384-dim model offers 16x throughput for a 9% retrieval penalty. For local agents, speed is context.

Dataset B

The Quantization Tax

Quant Level	Model Size	Perplexity (Lower is Better)	Tokens/Sec (RTX 3090)	VRAM Bus Load
FP16 (Uncompressed)	13.6 GB	5.42	45 t/s	98% (Bottleneck)
Q4_K_M (Recommended)	4.2 GB	5.51	110 t/s	45% (Optimal)
Q2_K	2.8 GB	6.80	115 t/s	30%

Conclusion: Dropping from FP16 to Q4_K_M yields a 2.4x speed increase with negligible (<0.1) perplexity loss. Q2 degrades reasoning significantly for minimal speed gain.

Contribute your Hardware Data

Our benchmarking tool is open source. Download the CLI, run the standard suite on your specific rig, and push the results to our public repository.

View Repo