int4 quant drops reasoning accuracy harder than classification
Evaluating 4-bit (int4/nf4, GPTQ or AWQ) quantization on a model before deploying to a workload with a mix of task types.
Int4 quantization typically loses 1-2% on classification / MMLU-style benchmarks — acceptable for most deployments. On multi-step reasoning (GSM8K, MATH, HumanEval+), the drop is often 4-8% with the same config. Root cause: reasoning chains accumulate per-token quantization noise across many generation steps, so small errors compound into wrong final answers more aggressively than single-shot predictions. A Mistral-7B int4 that's near-parity with fp16 on MMLU can be 6pts below on GSM8K.
When benchmarking any quantized model, run BOTH a classification and a reasoning eval. Do not assume MMLU parity generalizes to chain-of-thought performance. Int4 is often fine for tagging/RAG; dangerous for reasoning.