Measuring AI success: the KPIs that capture what actually changed

8 May

There is a moment in most AI automation programmes that I have come to recognise as a warning sign. It usually happens about six months after go-live. The initial metrics look fine; hours saved, error rate down, costs reduced. And then someone in the boardroom asks: 'So how is the AI programme actually performing?' And the team in the room realise they are not entirely sure how to answer.

Not because the programme isn't working. It almost certainly is. But because the metrics they chose to track at the outset were designed to prove the case for the investment, not to manage the programme over time. They measured what was easy to measure, not what actually mattered.

This is one of the most consistent patterns I see across enterprise AI programmes, and it is one of the most avoidable. The measurement framework is not an afterthought. It is part of the programme design. Getting it wrong doesn't just make reporting harder but it undermines executive confidence in ways that are difficult to recover from.

Why FTE savings are necessary but not sufficient

I want to be careful here, because I am not arguing that FTE savings and cost reduction don't matter. They do. They are often the primary metric in the original business case, and they should be tracked.

The problem is that they capture only one dimension of value. A programme that frees 38 FTEs from referral administration and reduces prior authorisation approval time from 48 hours to four hours has delivered value in both dimensions. But the cost saving is visible in the accounts. The clinical capacity gain or the downstream effect of faster authorisations on patient care, bed availability, and clinical workflow is much harder to see, and much larger.

The organisations that measure and report only the cost saving are systematically understating the value of their programmes. That understatement has consequences: it makes the next programme harder to fund, the executive sponsorship harder to maintain, and the case for expansion harder to make.

Decision velocity: the metric most often missing

Decision velocity is the measure of how long a decision takes from the moment a trigger occurs to the moment an outcome is produced. It is rarely tracked in enterprise reporting, and it is one of the most powerful indicators of what AI automation is actually delivering.

Consider prior authorisation in healthcare. Before automation, a clinician submitting an authorisation request might wait 48 hours for a response. The agent does not wait - it processes the request, applies the clinical criteria, produces a determination, and escalates exceptions to a physician, typically within minutes. The 44-hour reduction in decision time is not a cost saving. It is a clinical capability gain.

The same principle applies in financial services. An income reconciliation that previously required manual review over two to three days now completes in hours. The speed of that decision affects downstream processes such as cash management, reporting, compliance filings. Decision velocity is a metric that boards understand intuitively, because they can see how it connects to the outcomes the organisation actually cares about.

Capacity created versus headcount reduced

The standard way of reporting AI automation value is in FTE equivalents: 'this programme delivered the equivalent of X FTEs.' This framing has become so standard that most organisations no longer question it but it can be actively misleading.

Capacity created and headcount reduced are not the same thing. In most enterprise programmes, the capacity released by automation is not released from the payroll rather it is released from low-value, repetitive work and redirected toward higher-value activity. The clinical staff freed from referral administration can focus on patient care. The compliance analysts freed from manual reconciliation can focus on exception investigation and regulatory relationship management. The operations team freed from plant reporting can focus on the decisions that require their expertise.

The value of that redirection is real. It is measurable through productivity metrics, quality measures, or simply through what the organisation is now able to do that it couldn't before. Reporting it requires a bit more thought than counting FTEs, but the result is a much more compelling and accurate picture of what the programme has delivered.

Error rate and exception volume as value indicators

For processes that are complex, rule-bound, and high-stakes — prior authorisation, income reconciliation, reauthorisation workflows, compliance checks the reduction in error rate is often the most significant outcome of automation, financially and reputationally.

AI agents apply consistent logic every single time. They do not have bad days, they do not make the errors that come from fatigue or from working on something for the twelfth time in a row, and they do not interpret ambiguous instructions differently depending on who is handling the case. For a healthcare provider processing thousands of authorisations per week, even a small percentage reduction in error rate translates into significant avoided costs like rework, rerouting, compliance incidents, patient experience failures.

For a financial services firm, the value of near-zero error rates in reconciliation processes includes avoided regulatory penalties, improved audit outcomes, and reduced operational risk — categories of value that a cost-saving calculation will never capture.

Operational resilience: the undervalued metric

Resilience has historically been a difficult concept to put a number on. It is easy to say that a process is 'more resilient' when it is supported by automation. It is harder to quantify that resilience in terms that a CFO will find credible.

The framework I find most useful starts with the question: what would this failure cost? If the referral management process went down for 24 hours due to staff absence, system failure, or unexpected volume — what would the impact be? For a provider processing 250,000 referrals per year, the answer involves patient care delays, downstream care coordination failures, and significant operational cost. The value of preventing that failure is the value of the automation's resilience contribution.

The consumer goods programmes we have run provide a clear example. One client avoided over $200 million in production downtime through intelligent automation of their operational processes. That number is not a cost saving in the traditional sense but it is the value of continuity. It is real, it is measurable, and it does not appear in any FTE calculation.

Building the measurement framework before go-live

The reason measurement frameworks are so often inadequate is that they are designed after the programme is already running, when the question becomes 'what data do we have?' rather than 'what data do we need?' By that point, the instrumentation is already in place and changing it requires significant additional work.

The right approach is to design the measurement framework as part of the programme design — defining the key metrics, establishing the baseline, designing the data capture, and agreeing on the reporting cadence and audience before the first agent goes live. This is not complex. It is a planning discipline. But it requires someone to ask the question early enough for it to be answered properly.

The programmes that sustain executive sponsorship over multiple years are almost always the ones that have built this discipline in from the start. They can show, at any point, exactly what the programme has delivered in terms that the board can understand, challenge, and build on. That visibility is not a reporting exercise. It is what keeps the programme funded.

George Purvis

Measuring AI success: the KPIs that capture what actually changed

Building the hybrid workforce: what changes when AI agents join the team