It’s 9:03 a.m. on a Tuesday. A caregiver opens your gen-AI assistant and asks, “Do we qualify for copay support?” The assistant replies, “You likely qualify. Upload your documents to get started.” At 4:17 p.m., a different caregiver asks the same question and sees, “You may qualify. Please call this number to confirm.”
No one changed the phrasing. Yet the words moved.
That tiny wobble, from “likely” to “may” and from “upload” to “call,” is nondeterminism in action. In the pharma world, those shifts snowball into PRC follow-ups, confused callers and A/B tests you don’t quite trust.
A new memo from Thinking Machines puts a name to this phenomenon. It argues that the culprit isn’t “AI being creative” so much as the way real-world servers group requests under load. Even when you dial the randomness down to zero, production systems can still produce slightly different answers.
That’s because they change batch sizes depending on traffic. In turn, subtle numerical differences creep in and, over many layers, can nudge phrasing. That’s why the answer at lunchtime may not quite match the answer at breakfast, even if you didn’t touch a thing.
Imagine the kitchen versus the recipe. The chef (the model) will follow the same recipe every time, but the restaurant (the server) decides how many orders to cook at once. On a quiet afternoon, for instance, food might be simmering in a single pot, while during the dinner rush, it might be simmering in three.
In theory, batching shouldn’t change the taste. In practice, some steps in the cooking process are sensitive to how many pots are on the stove.
Why should marketers care? Start with reproducibility. If your team approved a specific phrasing for a coverage explanation last week, MLR does not want to see a mysteriously different cousin this one.
Then consider patient safety and access. Slight differences can flip an intent classifier or change a subsequent step. To a caregiver, “upload documents” versus “call first” is not a rounding error.
And finally, there’s the impact on brand consistency. Small shifts like the “Queens, New York” versus “New York City” example that the memo highlights become reconciling tickets, mismatched screenshots and late-night email threads about which screen represents the “truth.”
Thinking Machines characterizes this as batch invariance: The math inside key operations (such as normalization and matrix multiplications) can produce numerically different results when the batch changes, even if each individual run is deterministic in isolation. From a user’s point of view, outputs feel nondeterministic, because system load is unpredictable. In pharma, that’s the bridge between infrastructure detail and brand-manager headache.
This is not just an opinion. Popular inference engines acknowledge it in their own documentation.
The good news is also the headline: There’s a practical path to make outputs repeatable without shutting down your server farm. Thinking Machines shows that if you change the core kernels so that the numerics behave the same way whether there’s one request or 30, then the same prompt produces the same answer every time, even under load. That is the difference between “we hope it matches PRC” and “we can reproduce the screenshot.”
Alternatively, teams can sidestep the challenge entirely by adopting non-generative approaches for high-stakes content. Some organizations use structured retrieval systems (what we might call an Ostro approach) where responses are assembled from pre-approved, static components.
This isn’t just about caching frequent queries; it’s about architecting the entire system around deterministic content assembly rather than generation. For pharma use cases touching labels, eligibility or safety information, this approach offers complete reproducibility by design.
There is a trade-off in terms of speed. In tests, an unoptimized deterministic path roughly doubles latency versus the default, and a better attention kernel narrows the gap. But “slightly slower” is often a fair price for “defensible and auditable,” especially when the flow touches label, eligibility or safety. And because this is an engineering problem rather than a law of nature, the performance tax will likely shrink as kernels and runtimes mature. In other words: This is a tunable dial, not a cliff.
What does this mean for your programs? It suggests a need to run your AI in two clearly labeled modes. Creative work, subject lines, social alternates and early-stage copy exploration can embrace the wiggle (let’s call that creative mode). Regulated experiences demanding repeatable behavior, such as label-touched content, coverage explanations, eligibility screeners and safety prompts, need stricter guardrails (that can be known as governance mode).
In governance mode, you freeze model and decoding settings; log provenance for every response (model version, decoding parameters, server build, effective batch size); and prove reproducibility before you ship with a small golden set of prompts that are run multiple times.
If your vendor can enable a documented deterministic mode, you turn it on for these flows. If they can’t or if you want to avoid the complexity entirely, consider non-generative architectures that use retrieval and template assembly.
You can also be fussy about caching, in a helpful way. For the dozen intents that drive the bulk of patient and HCP questions, caching exact, PRC-approved answers and serving them verbatim when the prompt matches is not laziness. Rather, it’s product quality. The content still has a heartbeat: Upstream facts refresh on a schedule, your cache rebuilds on a release cadence and your analytics trace each served answer back to a model/config hash. Thus any A/B test you run is about copy, as opposed to yesterday’s server traffic.
If you want a simple “how would we know we’re done?” test, imagine re-running last quarter’s approved prompts two months from now. In the best-case scenario, you get the same answers in governance mode, your screenshots match PRC’s archive, your analytics are clean because every experiment pins model and serving settings and the only surprises happen in creative mode – where, after all, surprises belong. That brand of boring is a feature, not a bug.
The larger point, one that Thinking Machines makes authoritatively, is that the randomness here isn’t mystical. It’s a few concrete implementation choices leaking into user experience through batching. Force those choices to behave the same way every time, and the mystery vanishes. What remains is an explicit trade curve between speed and reproducibility that is tuned on a use-case-by-use-case basis, instead of a roulette wheel that spins whenever your traffic spikes.
Ultimately, pharma teams face a choice: Invest in making generative AI deterministic through careful engineering, or bypass the challenge by using non-generative approaches for critical touchpoints. Both paths are valid; the key is choosing intentionally based on your use case rather than defaulting to one approach.
Has your organization wrestled with the notion of AI wobble? Drop us a note at hello@kinara.co, join the conversation on X (@KinaraBio) and subscribe on the website to receive Kinara content.