Transforming Enterprise AI with Scalable LLM Deployments

"When you use ChatGPT day to day, are you really using one model, or is there an agent behind it?"
He does not ask for an answer in the room. He wants the question to stick because the rest of his talk is about what happens when enterprises stop treating generative AI like a demo and start treating it like production software with real users, real data, and real consequences.
Mahmoud is a Lead AI Engineer at Mastercard, working in the unglamorous, high-impact zone where models meet reality: deployment, scaling, governance, and keeping systems safe. His message is not that large language models are magic. It is that shipping them responsibly is hard, and the hardest parts are rarely the model.
Why enterprises struggle with classic ML
Mahmoud starts with a quick rewind. For two decades, machine learning and deep learning have delivered wins across classification tasks: image diagnostics, forecasting, fraud detection, and more. The problem is that enterprise environments are different.
Enterprises run on unstructured data. Mahmoud quotes a common estimate that 80 to 90 percent of enterprise data is unstructured. That is the heart of the dilemma. Traditional ML often wants labeled datasets and clean pipelines. Labeling at enterprise scale is slow, expensive, and politically messy.
Generative AI changes the shape of the problem. If you can take unstructured data, ground a model in it, and generate useful answers, you suddenly have a path to value that does not depend on labeling everything first. This is the promise that has pulled so many companies into the LLM wave.
But Mahmoud draws a sharp boundary between promise and production.
Stop obsessing over tomorrow, worry about today’s risks
He calls out a pattern he sees everywhere: people jumping straight to AI doomsday narratives about the future. Mahmoud’s view is more immediate. The biggest problem is not that AI will build a distant future. The biggest problem is the risk it creates right now.
In enterprises, the threats are not abstract. They are compliance risks, data exposure, governance gaps, and operational failures. He references a Nature piece that argues for shifting attention away from speculative catastrophe and toward present-day harms and controls.
This framing matters because it changes what teams prioritize. If you think the main issue is the future, you debate philosophy. If you think the main issue is today, you build safeguards.
The misconception that LLMs began with OpenAI
Mahmoud also tackles a myth that distorts how leaders make decisions. Generative AI did not begin with OpenAI. Language models existed decades earlier. What OpenAI changed, in his telling, was the interface and usability. The leap was not only the model. It was making the system conversational and accessible enough that anyone could “talk to it” naturally.
In enterprise settings, that usability shift is double-edged. The easier it is to use, the easier it is to misuse. A chat interface can become a shortcut to sensitive data if guardrails are weak.
The real stack: model, environment, software layer, infrastructure
Mahmoud lays out the major layers you have to think about when deploying LLMs:
The foundation model, whether you build, fine-tune, or adopt open source.
The experimentation environment, where teams test and iterate.
The infrastructure, often GPUs, and serving capacity.
And the software layer that holds everything together.
He emphasizes that enterprises can often acquire models and hardware. Open source platforms and model hubs make it easy to start. Cloud and procurement budgets make infrastructure attainable. The hard part is the layer in between: integration, orchestration, user-facing reliability, monitoring, and the plumbing that connects models to the right data and the right controls. Without that software layer, you do not have a scalable product. You have a brittle experiment.
He reinforces this with a familiar idea from the ML world: only a small percentage of the work is the model itself. The rest is data governance, pipelines, observability, security, and operations. In his own Mastercard experience, he says the LLM is only a tiny fraction of the workload.
The “closed book” problem: why raw LLMs fail in enterprise
Mahmoud then zooms in on what happens if you try to use an LLM without grounding it in an enterprise context. He describes a “closed book” approach where the model answers from its training knowledge, plus the prompt. That approach breaks down quickly in real organizations.
He lists the failure modes his team has to manage:
Hallucination. Confident answers that are not true.
Attribution. The need to explain how a response was produced, especially in risk-oriented domains.
Model staleness. The fact that models age fast as better ones appear and contexts change.
Revision and deletion. When data rights change, you must remove data and potentially retrain or rebuild.
Customization. The explosion of domain-specific needs across business units.
That last one is a common enterprise trap. If every domain needs its own model, you quickly end up with dozens of parallel deployments that are impossible to maintain. So the obvious response is the one the industry has rallied around.
RAG: adding memory outside the model
Mahmoud describes the shift toward coupling a model with an external memory, most commonly Retrieval Augmented Generation. Instead of asking the model to “know” the domain, you retrieve relevant enterprise knowledge and feed it as context. This can reduce hallucination and improve relevance.
But he is careful not to oversell it. RAG introduces its own complexity, and he hints at the kinds of design questions that become engineering headaches:
How should you chunk content?
How long should chunks be?
How do you structure retrieval?
How do you keep context fresh?
How do you prevent leakage across users and teams?
RAG is not a silver bullet. It is a trade. You swap “model knows everything” for “system retrieves the right things,” which turns model deployment into distributed systems engineering plus governance.
This is where his talk becomes very enterprise-specific.
The rise of agents and the danger of building them for everything
Then Mahmoud pivots to what he sees as the next evolution: agents.
He distinguishes three stages:
Single LLM is the basic prompt response pattern that many teams used in 2022.
Workflows, where you send a task across multiple models or steps and aggregate results.
Agents, where the system reasons, acts, uses tools, and loops through feedback.
He warns against treating agents like a trend. Do not build agents for everything. Keep it simple. The more autonomous the system, the more risk and operational overhead you inherit.
To help teams decide, he offers a simple mental checklist. You should build an agent when the task is complex and multi-step. If it is not complex, use workflows. If the task is complex but not fully doable, reduce the scope. If the task is complex and doable but the cost of error is high, put a human in the loop.
In a company like Mastercard, that last condition is not optional. It is the default. Humans remain part of the system when errors carry legal, financial, or trust consequences.
Think like your agent
Mahmoud’s most memorable teaching moment is his “think like your agent” example. He describes a simple task: searching for “Mastercard Global.”
A human opens a browser, waits, types, clicks, and reads. An agent does the same thing through tools: click, type, navigate, retrieve, and respond. The point is not that this is impressive. The point is that it demystifies agents. They are models using tools in a loop, not mystical intelligence.
That framing is useful for product teams, too. If you can describe what a human would do step by step, you can design the agent’s tool access, constraints, and evaluation criteria. If you cannot describe it clearly, you probably are not ready for agent autonomy.
Multi-agent collaboration and the protocols that enable it
Mahmoud looks beyond single agents toward multi-agent systems, where specialized agents collaborate like a team of experts, coordinated to solve complex tasks. He describes roles like a mathematician, statistician, and AI specialist working together, with a coordinator in the middle.
To make that possible, systems need standardized ways to connect agents to tools and agents to each other. He references two emerging protocol ideas:
Model Context Protocol, a unified way for models to connect to external tools and data sources.
Agent-to-agent protocols are a way for agents to communicate with other agents.
His message is that these approaches complement each other. One helps connect an agent to the world of tools. The other helps connect agents to each other.
It is also the moment where the host draws an interesting parallel: we spent years preaching cross-functional collaboration in human teams, and now we are building similar dynamics into our AI systems.
Mahmoud does not over-philosophize it, but the echo is there. Enterprise AI is not only about smarter models. It is about orchestrating a system of capabilities, just like product building has always been.
Mahmoud ends on a question a reviewer once posed to him: after all the risks and challenges, will LLMs and agents help us, or will we help them?
Want to watch the full talk?
You can find it here on UXDX: https://uxdx.com/session/transforming-enterprise-ai-with-scalable-llm-deployments
Want to make your career AI-proof? Make sure to read our new ebook: https://uxdx.com/ebook/career-compression/
Rory Madden
FounderUXDX
I hate "It depends"! Organisations are complex but I believe that if you resort to it depends it means that you haven't explained it properly or you don't understand it. Having run UXDX for over 6 years I am using the knowledge from hundreds of case studies to create the UXDX model - an opinionated, principle-driven model that will help organisations change their ways of working without "It depends".
Get latest articles straight to your inbox
A weekly list of the latest news across Product, UX, Design and Dev.

