
Sid Bhatia, area VP and GM META at Dataiku/Image: Supplied
The UAE has become the region’s preeminent proving ground for artificial intelligence (AI). An Emirates NBD report predicts AI will contribute more than $96bn to the UAE’s GDP by 2031.
A large part of the growth is expected to be attributed to AI agents. Agents are different from other forms of AI in that they perform independently and make plans on their own. Their agility, powered by probabilistic reasoning, allows them to adapt while operating. While endowing agents with powerful capabilities, these differences make agentic AI more unpredictable than its predecessors, with the same inputs yielding different outputs between runs. But agents have real-world dirham investments behind them, so how do adopters calculate ROI, or indeed assess in any meaningful way a technology that is so non-deterministic?
An AI agent has the potential to not only fail in the performance of its assigned task but to take unnecessary actions, leak sensitive data, and a range of other undesirable slip-ups. Meaningful evaluation of agents must go beyond traditional pass/fail results and incorporate all possible points of error. The goal is to rate performance on a quality scale that encompasses reasoning, adaptation, and value delivery. To avoid distasteful user experiences, non-compliance, latency, and budget waste – all risks that are run by a business that does not evaluate individual agents – I recommend five metric categories for the assessment of AI agents.
01 Task outcomes
The success and quality of output of core tasks is as important a metric for agentic AI as it was for its predecessors. Evaluators must look for accuracy and reliability. Measure task completion rates of major workflows. Allow domain specialists to catalog the accuracy, precision, and compliance of outputs. Take note of the frequency of errors, retries, and escalations. Adopt industry-standard benchmarks and incorporate human reviews to uncover problems with business-critical tasks. All measurements must be continuous to give opportunities for improvement over time.
02 Business value
User satisfaction and other tangible business gains must be goals for AI as they are for any other business asset. Agents may be accurate; they may be precise to 20 decimal places. But if they do not add value for end users, then formal evaluations must reflect that. Calculate time saved per workflow and compare it with an established baseline. Look at adoption and repeat usage rates and use Net Promoter Score (NPS) or satisfaction surveys tailored to interactions with AI agents. A/B testing can be used between agentic and traditional workflows. Follow the user journey and encourage feedback to discover pain points.
03 Effectiveness
Assess the quality of reasoning of agents. Trace how they autonomously build workflows in real time. Do they make use of the right tools? Do they perform tasks in the optimal number of steps, and in the optimal order? By monitoring efficiency with this level of granularity, organisations also make systems more transparent. Additionally, assessors should note the frequency with which chains of reasoning are abandoned, so they must ensure they can trace these chains and visualise “agent trails”.
04 Governance
Trust is critical in an AI journey. Oversight ensures compliance; transparency, auditability, and safeguards enable oversight. Evaluation of an AI agent must include confirmation of its safe and ethical operation, as well as verification of its compliance and auditability. Instances of policy violation, bias, or undesirable outputs should be recorded. Organisations must also find ways of assessing the auditability of decision-making and tool usage, as well as the effectiveness of safeguards. Red-teaming, guardrails and moderation systems can help, but it is also important to run regular automated safety tests and keep comprehensive audit logs.
05 Live performance
Operational performance at scale must be evaluated if a full picture of the agent is to be seen. Agentic AI is powerful on paper, but does each agent fulfil enterprise-grade workload requirements? Measure latency, uptime rates, error frequencies, costs per interaction, and model drift. Always-on dashboards must be available and should flag anomalies as they emerge. Stress testing during peak usage is also crucial. Track costs at the user, team, and workflow levels to ensure they do not spiral out of control.
Not-so-secret agents
These metric categories form a framework for building trust — and hence, acceptance — of agentic AI over time. Throughout the course of the AI journey, agents will improve autonomously under the right conditions – careful monitoring across cycles of deployment coupled with expert-guided refinement. This approach is important because it helps ensure that metrics are aligned with business needs rather than arbitrarily defined AI benchmarks. IT, line-of-business, and compliance teams must collaborate on evaluation. Without subject-matter experts, the evaluation of agentic AI will be insufficient to identify all business risks and relevant success criteria.
Agents must be safe and useful. As the growth of the field unfolds across the UAE, we will see more enterprises experiment with AI agents. Successes will accrue to those that introduce the right evaluation practices early. Those enterprises will enjoy confidence in more than just the accuracy and precision of their agents. They will know that they are reliable, safe, scalable, and transparent.