Strategic AI Integration in Engineering Teams

Talk

Strategic AI Integration in Engineering Teams

Continuous Development
UXDX USA 2024
Slides

As senior management in the Enterprise AI section of Google, Keyvan offers a fresh perspective on integrating Artificial Intelligence into your teams. He transforms the conventional view of AI from a challenging issue into a dynamic tool. Keyvan will walk through practical strategies for incorporating AI into your projects, managing expectations, and improving user interactions with a human-centric focus. Keyvan urges us to recognize AI's potential as a multifaceted tool for diverse problem-solving, moving beyond restrictive applications. Key Points discussed:

  • Practical AI Use Cases: Learn to identify where AI can genuinely enhance your product lifecycle. Keyvan will guide you through selecting suitable AI applications that align with your team's goals and project demands
  • Managing Expectations with AI: Understand how to set realistic business expectations around AI initiatives. Keyvan will discuss integrating probability-based outcomes into AI workflows, enabling more informed decision-making and expectation management
  • Google’s Approach to AI: Gain insights into Google's internal strategy for AI, emphasizing user value and process optimization. Discover how a leading tech giant approaches AI to ensure its contributions are meaningful and impactful

By attending Keyvan's talk, you'll leave with a clearer understanding of how to effectively integrate AI into your product teams, fostering innovation while staying grounded in user-centric principles

Keyvan Azami

Keyvan Azami, Enterprise AI Engineering Lead,Google

I'm Keyvan Azami from Google. I'm an engineering manager and I focus on solving business problems for Google's own day-to-day operations using ML. I lead a team that's been doing this for quite some time - eight years or so - before GenAI was in everybody's minds. What I'm hoping to do today is just walk you through some of our experiences and hopefully let you take away some insights about how to apply ML in day-to-day life, how to deal with the uncertainties that it brings about.
I might not bedazzle you with the greatest in ML - you can get that from Google I/O Cloud and all those places - but I find that there's a different approach to applying ML to real-world solutions that sometimes that muscle hasn't been developed in all of our teams so far.
Nobody gave me a clicker to move the slides forward... Ah there it is, thank you, appreciate it.
Our mission: Applied AI Research as Enterprise Competitive Advantage. What we do is work with Google Research, specifically Google DeepMind, and look at what relevant capabilities they're working on that could enhance our business functionality. We work with different pillars at Google - Finance, Legal, HR, Technical Support, etc. - and find those solutions that could make sense.
I'm going to walk through a case study that we worked on a few years ago and talk you through some of our processes, some of the challenges that we came across and how we overcame them. Hopefully this would be interesting and useful.
Just to set the scene, Google as you might guess has billions of users across their products such as Search, Maps, YouTube, Cloud, etc. As of the last financial statement, we had 180,000 employees plus an extended workforce. These folks generate millions of inquiries every year for IT support - systems, hardware, software, etc. What we want to do is get people answers faster and make both our support staff as well as our Googlers more efficient.
The self-service automation goal is to identify critical self-help areas and make this as intuitive and accessible as possible. With this project, my team was tasked with surfacing relevant support articles to internal users when they raise a ticket. When they raise a ticket, don't wait - if I can find a solution for it automatically, make a recommendation, link them to the help article, and let them get the problem solved sooner than later.

Our general ML lifecycle is a little bit different from traditional software development. We start with ideation and prioritization. Data acquisition and data exploration is a big important phase for us, and I'll walk you through the details of that. Then we have prototyping, and you might see some interesting turns of events here where we don't actually follow a linear path to get through this because the nature of ML is iterative and there's a lot of experimentation that goes on.
Ideation and prioritization starts with a business problem. I know it's very tempting to think that the answer is Generative AI no matter what the question is, but start with the problem. What is a business problem that you're trying to solve? Assess that on its own merit and look at the complexity of the potential solution and the possible impact. We lay out the different ideas, we lay out the impact and complexity, and we start focusing from the top.
For article suggestions - finding support articles relevant to the problem at hand - what's the business impact? Tens of thousands of hours in efficiency gains for Googlers and support staff. Is ML a good solution to the problem? Do we have enough data? What kind of data is it? What type of problem are we trying to solve? What's the risk tolerance for the users, stakeholders, and the support organization? What happens if we get the answer wrong? That's an important question because ML is probabilistic. If somebody says "I need 100% guarantee that you give me the right answer," I can't do that project. We have to be able to deal with that uncertainty in the process. Can we measure success? Are there metrics available so we can know whether we've actually achieved the goal? And the answer was yes.
Data acquisition - what do we do? We looked at our support data. We had millions of resolved tickets, about 10% of them were linked to an article. Somebody raised a question such as "I can't remote into my machine," and they waited a couple hours until one of the support staff came on board, saw the ticket, and responded with a link to the article that solved the problem. We want to automate that 10% - that's tens of thousands of tickets that could be solved.
What data would we want? We want validated responses. We want to look at the tickets that were solved and had an article linked to it, so finding those is really critical.
Data exploration - we spend quite a bit of time here. We don't jump into the solutioning. What are we actually looking at? We claim that we have these tickets and we claim that we have these articles - what do they look like? What is the relationship between them? My team at Google uses Google Colab quite a bit. There are a lot of other solutions out there - Jupiter notebooks, etc. - where you can pull your data and have different illustrations to help you understand the relationship between the features that you're trying to map to each other.
This is where I think it's more than an engineering task. This is where UX researchers come into play, where our product teams can come into play to help identify these relevant important capabilities that we can draw from the data. The tickets have titles, descriptions, links, and the articles will have a bunch of text, titles and descriptions. So it's a big language problem that we have to solve for, and obviously we know that there is this issue that we can only solve 10% of the data, 10% of the tickets.
Model prototyping - we have some data, we've looked at this, and let's start with a hypothesis and see if we can solve it. A typical approach is something called a dual encoder. You take the two things that you're trying to map to each other - the tickets and the articles - and you develop a model that codifies these two documents such that if they're related to each other, the code between them will be very similar. Details are not that important in terms of how that works - you can always look up dual encoders and there's tons of literature out there.
We start with this to see if the solution works. We start with even a very simple algorithm - TFIDF (again, look it up if you're interested) but basically look for frequency of meaningful words in the documents that you're looking at. Then we can move on to more complex stuff - embeddings, contextual embeddings, etc.
We run this model, we train this model and we look at the results. Precision: 40% - out of all the tickets that the model recommended an article for, only 40% of them were actually correct. The other 60% recommended the wrong article. And what's worse - recall: 15%. What does that mean? It means that out of all the tickets that could have had an article associated with it, we only recommended 15% of them with an article. So we missed out on 85%.
The results weren't great. But this is where the iterative nature of the work comes out. You start asking: Does the model not perform well because of my features? Do I need more information about the problem that I'm trying to solve? Does the model not perform well but there is room for experimentation? Can we try other techniques? We started with dual encoder and some language techniques, but are there better solutions to this? And what's worse - did I even think about the problem right? Did I frame the problem correctly?
And then here's where you realize - I thought I was moving forward but no, we go back and forth. This was the lifecycle that I described and we actually work like this. We go back, we look at the data, we look at the idea again and say how can we improve this? How can we make this more appropriate? This is where the interaction between engineering, product, and UX becomes quite important because that problem framing stage is where you're trying to understand how you're solving the business problem.
Ideation prioritization again - focus on the impact we're trying to achieve. What are we trying to do? Do we really need to suggest from the long tail of articles? We've got a lot of articles - tens of thousands of them. Do we actually care in the first instance that every one of them is looked at for each ticket? Maybe not. What we need to do is focus on an 80/20 approach. If we get 80% of the tickets resolved with the most common articles, that's still a win.
Can we simplify the problem? This is where we reframe the problem differently. Instead of finding an article, we say actually we're going to classify it. We're going to say here are the most common problems and for a ticket, we're going to say which class, which type of ticket, which type of article does it relate to most.
Different method, different technique, and focusing on returning the right article. If you remember we looked at that precision and recall - 40% precision, 15% recall. What we focus on is let's get the articles right even if we miss the opportunity. The ones that we do try, let's make sure the answer for those is right so at least we're helpful and we don't send people in the wrong direction.
We start back in the prototyping phase. We try to identify the articles - let's say we find a thousand articles and we start mapping them. We train a model to try to guess which article would be most relevant. Language modeling as we talked about - we start getting better results, not perfect but better. 50% precision, 25% recall. We're getting more accurate but still not good enough.
What's wrong this time? Does it not perform well but is there room? We're still short of the goals. Time to debug. We start engaging with the stakeholders, the partners. Again, back to the iterative nature - bring them back in, let them see, let them weigh in on the problem and the solution.
We show them some examples. A ticket comes in, the model says X, and in the data actually there was no article that would have been appropriate. The feedback is that seems like a mistake. We do this again, they kind of understand it. We do this again and they start seeing that actually there's a labeling issue. They haven't - all the data we were looking at initially and looking at tickets that were resolved and there was an article - it wasn't always correct. There were some false positives in there, there were some tickets that were linked to the wrong article which we didn't anticipate initially.
So we go back to data acquisition. Here we work with the partners and said okay, what if we actually run a labeling exercise? We take 2,000 tickets, we take a bunch of them and then have your expert staff label them. Make sure the articles are correct, make sure they're linked correctly so that when we're training a model it's learning from correct information. Sort of the old adage of garbage in, garbage out. So we clean that up and surprisingly even with 2,000 tickets you can make a dramatic improvement.
We start prototyping again with this labeled data and we try a bunch of different ideas and we start seeing significant improvement. We get to 80% precision - that's really good, that's really good in the ML world. We can work with that. And recall is improving but that's okay. We're like, get 43% of the problems solved that could have been solved and with 80% precision that's a solution that the stakeholders can work with. But it's important to engage them at that level because ultimately when it gets wrong, somebody's going to complain "Hey you sent me the wrong way" and they need to be comfortable that the benefits are worth the occasional pain and escalation that may happen.
Then the hard part - model productionization. Spend all this time looking at data, building the model, but actually to make a solution, put it in production and make it an industrialized solution you need a lot of different pieces and ML code is a small part of it. We have all this apparatus that you have to put together. Larger organizations will have their own third-party solutions, cloud solutions across all the typical vendors, but this is where you have to put it together and make it work as real software would.
Monitoring and maintenance - now here's the problem. We looked at tickets at a point in time, what if things changed? COVID happened and the problems were different. People were complaining about different types of issues. They're working remotely, there were much more remote-oriented questions and issues. So you start seeing actually a drop in the quality of your model. It doesn't often find the articles that it needs. It starts sending people down the wrong way more often. That precision drops.
You need to be able to measure this. You need to have tools, dashboards, reports, whatever it is - some way of metrics. You don't need to do this on a day-to-day basis, but at least on a quarterly basis if not more frequently so you can feel when the solution changes. In traditional software development, you have regression when you change the software. Well here you have regression even if you change nothing - the data distribution in production changes. So you got to keep an eye on that. That's sort of what we call model drift and you'll have to have retraining. You could have continuous training cycles that will help you.
Takeaways and other considerations: Have a structured approach to machine learning projects. It's very tempting to say "Hey just give me some data I'm going to start prototyping this" and this is where I think product teams are quite important because you can kind of ground the engineers a little bit more as well. Think about the problem, frame your problem correctly, think about the different approaches that you could take, think about the users. We fail to mention how much of our focus is on sort of human-in-the-loop process. How do we make sure that this automation is not sort of people versus machines - it's people with machines. It's an assist of technology, not independent automation.
Be ready to iterate and ask questions. That's okay and we can fail fast. Those diagrams that you saw - be okay with getting to a point and saying "Well you know what the precision and recall is just not good enough we're going to abandon this project" and the sooner you do that as opposed to sink in many months of iterative work only to get marginal improvements. And celebrate those failures - in my book actually not failures, they are experiences that you've gained. We have developed high tolerance for that type of thing because we want to encourage experimentation but we want to fail fast so we don't waste too much time.
Leverage the right tools for the job. ML has different tool sets so you look for them, research them. There are a lot of good opportunities. The space is improving quite fast as well. And look at the impactful problems that are fit for machine learning. The first question is can I solve this problem with a rules-based approach with heuristics? If you can, that's where you should start, not with machine learning. That should be the next step. Develop a baseline so you can build on top of it.
So that was our case study. We have a few minutes - I'm happy to answer any questions if there are any, but also folks can always reach out offline too.

Q: Thank you for sharing your work for this article surfacing ML project. Can you share how you work with the UX designer? What was good and what can be improved? A: Great question. So think about from a user perspective whether it's a Googler or a support agent, what do they see? The questions for us was when they raise an issue, how do we reach back out to them, how do we provide the feedback? Do we send them an email? Do we help the agent and then have the agent validate before we send that? But that kind of created some delays. Do we add the article directly into the ticket and hope for the best? Those were some of the considerations that we were working with our UX partners. There were future phases of this project where we had more real-time interaction, for instance when they're actually engaging an agent in real time, where I think the user experience was a little bit more dynamic. So thinking about how do you surface relevant articles as the chat goes on. For instance, here we talked about a ticket mapping to an article, but if you have a live chat and constantly the model is monitoring the conversation and says "Ah now I understand what they're talking about it's probably this article" and then sort of how do you surface that was definitely a key aspect. But really thinking about what is how do we want to expose the user to potential uncertainty was was a consideration for the UX team.
Q: Can you elaborate on rules-based approach and creating a baseline? We have opportunity to apply machine learning but my junior employees don't quite get what creating baseline and rules mean. A: Thank you for the question. Let's say we had - I mean this problem may not be appropriate but the idea being that you know how would you describe the solution in sort of traditional ways you know with if-then-else right? So if you see this type of issue if you see the following words in the ticket you know do the following thing, choose this, recommend that, or show them a list of potential articles that would be a rule a heuristic that you could apply because that's sort of what the human nature would be when they see that issue. So try to replicate that in code and that will give you and the baseline here would be that and how effective is that. So if you were to obviously without ML here would be hard to kind of match free text between articles and tickets but you know if you had if you were looking for like words and try to word matching and things like that you could probably get somewhere and you could look at the quality of that the precision and recall of just that rules based approach a very traditional software development approach and and you probably will find that a precision may be like 20% and recall maybe 50% because you probably will try more and and that kind of gives you that baseline to say okay if I apply ML can I get better than that and how much better for the effort.
Q: How do you make sure your structured approach is effective? Do you validate your approach before you start the project? A: Yes, I mean this is there is a little bit of sort of experience from doing one or two projects that you kind of develop, but by structured I think even just looking at some of the blueprints that I illustrated, the steps that we take and kind of thinking about your project in that way. So when you're trying to plan a project it's kind of hard when somebody says how long is it going to take, how long does it take to build a machine learning project? I don't know, I really don't know because I don't know what's in the data, I don't know what will work. So what I can do is say okay the following of my next activities I'm going to look at data and I'm going to take three weeks doing that and I may have to extend it but at least I can have an estimate for the following task. So you can kind of break down your project that way and then gradually kind of develop confidence around some of the abilities. If you've done a project with language recognition a couple times, the third time you may be able to have a better approach to estimating it, but by just kind of breaking down the phases of it and try to time-boxing it will help you manage expectations and set some goals for the team that are working on it.
Q: At what stage should the UX designer be getting involved? A: Very good question. I think UX designers need to be involved from early on actually the early phases. It can be sometimes tempting to say okay people are talking about data science and models and algorithms instead of you know I can back away but a recent project that I've had, we kind of had early involvement but very quickly we focused the conversation on engineering and then with the product that has developed sort of has missed the opportunity to have to think about the UX aspect of it and so we're now backtracking. So I think early on, making sure that before you jump into what ML can do is thinking about that experience and what you know how would that work and just assuming that the model is not going to be perfect the model is going to have X% precision so how do you surface that uncertainty to the users you can't hide it from them you can't say to the user here's the answer I know this is the answer you have to be able to leave some room and that's the job for product and design teams to figure out how we want to kind of manage that risk. So I think early on is the answer. How early? Day one. Day one, day two something like that.
Q: Was 80% of improvement accuracy just from labeling? A: I mean the labeling improved significantly but if I go back I don't know if I can't see it here anymore but we actually looked at some language models. So this project is from a few years ago so we use you know the BERT language model that that helped us quite a bit it had a much more sophisticated ability to understand the context of those articles compared to sort of the initial primitive sort of TFIDF approach so there is a little bit that you can gain well a lot that you can gain from improving the techniques that you're using especially nowadays with very sophisticated language models you probably can do even better, but the labeling was important because that was your ground truth that was the thing that you could kind of go back to and make sure that that sort of helped drive the precision.