temperature=0 is not deterministic on gpt-5.x
Running the same prompt with temperature=0 on a recent OpenAI reasoning-capable model multiple times against identical input and watching the output drift.
With the gpt-4o-family, temperature=0 was effectively deterministic — same prompt + same input + temp=0 reliably produced the same output across calls. With gpt-5.x reasoning-capable models that property does not hold: identical inputs at temp=0 produce meaningfully different outputs across calls, because the internal reasoning path is itself sampled even when the final-token sampling temperature is pinned. A specific failure mode you saw once may not reproduce on the next call, which makes regression-style "the model used to do X here" debugging unreliable. Two practical consequences: (1) prompt sweeps need multiple runs per prompt to characterise behaviour, not one — a single call per variation gives misleadingly clean comparisons; (2) load-bearing safety should live in post-processing (confidence-floor filters, downstream validators), not in the prompt rules — the prompt rules are doing less than you think.
When picking a prompt for production on a gpt-5.x or other reasoning-capable model, run each candidate prompt at least 5 times on the same input and compute the variance, not just one call each. And put the safety net in confidence floors + post-filters, not in the prompt itself.