ai artificialintelligence claude anthropic

TIKTOK

ai artificialintelligence claude anthropic

Jun 02, 2026

4919 words 60% confidence

Thumbs up. All good? All right. Everyone, I hope that you have had a fantastic day at Code with Claude London so far today. My name is Will. I'm on our engineering team at Anthropic. I sit on a team called Applied AI. What that means is I essentially split my time between internal engineering work and time spent building agents with customers. So folks, imagine that you built and shipped an agent to solve a problem. I'm sure that's something that a lot of the folks in this room have actually done. And imagine that this agent worked fantastic, right? But it worked so well that a few weeks after shipping, you were asked to add some additional capability to the agent. A few weeks after that, you received more business requirements and you added additional capability. This pattern continued and continued until before you know it, your system prompt had grown to become several hundred lines long. You have dozens of tools and sub-agents that exist for your agent. And because of the complexity, you've started to see regressions in the areas that your agent was previously accelerating in. So if this is you, you're not alone. We see this type of scenario happen pretty commonly with customers and actually with ourselves included in that. So within this workshop, we are going to simulate an agent that has essentially grown to a complexity where we start to see degradation in its performance. We're then going to walk through some of the decisions that we as engineers and architects make in order to improve the design of our agent to restore the performance that we expect with the additional capability. Specifically, we're going to make some decisions around tools and skills and sub-agents. As we modernize the stack of our agents, we want to make sure that we're using the right agentic primitives at the right time. So when do you use a tool, when do you use a skill, and when do you use a sub-agent? We're going to talk through all of that in this session. As I mentioned, folks, this session will be hands-on. So let's go ahead and get started. I first want to walk you through our problem statement in our agent. So for the purposes of this session, we're going to be focusing on an agent called Stock Pilot. This is an inventory management agent that was designed by and for a mid-sized retailer. The agent that you see on the screen can do several things. It can flag low levels of stock. It can forecast demand. It can pick suppliers. It can file POs. And ultimately, it can write weekly reports for the employees of this retailer. Now, none of these capabilities are particularly complex on their own, but again, the issue is that we've essentially bolted capabilities onto our agent over time. Without modernizing our architecture, this complexity has started to cause some problems. Let's take a look at the actual architecture today of the agent. Folks, today the agent is facilitated by a single orchestrator. So you see the Stock Pilot orchestrator sitting at the top of the screen. The agent has a system prompt, as I mentioned, that's grown to be about 400 lines long. It has 12 different tools. Three of those tools happen to be wrappers around subagents with completely isolated context windows. So if you have the repo pulled up, which we'll go into more detail in just a bit, there's an agent that's under a folder called before, which essentially walks through this agent exactly. So again, orchestrator, long system prompt, a lot of tools. We have a lot of subagents. The result of this is that our evals have started to dip. So let's imagine how we got here for a moment. Again, we built that agent up front to solve a really specific problem. We received business requirements to, say, add maybe some forecasting capability to our inventory management agent. So what we decided to do was essentially just spin up a forecaster as a subagent. Again, later on, we received more requirements to add report writing capability to our agent. So we decided to add another subagent for that report writing capability. Again, our evals started to dip over time because we added more and more complexity while just bolting this capability on. So let's take just a little bit of time and talk about evals specifically. For this agent, folks, we have 12 different eval tasks across five different types of graders. So my colleague gave a talk on eval shortly before this. Evals will have a component within this workshop, but it won't be the main focus. I'll give you a quick summary of the tactical evals that we're using for this agent. On the left side of the screen, you see some IDs. You see several evals that start with the letter R. This stands for regression. These are more realistic, single-turn tasks that we grade the model's capability on. So imagine I give the model a task. The model comprehends that task within the agent, calls some tools, and then provides a response back to me. We're essentially evaluating that response. We also have some more complex tasks that we're grading the model on. So you see those F IDs, the IDs that start with F on the left side of the screen? That stands for failure mode. In this case, we're evaluating the model over a more complicated, multi-turn task that we're grading. Now, again, I won't go into evals too specifically. We have a number of different types of graders that are both deterministic and non-deterministic. When I talk about deterministic evals, we're grading things like turn count, and like latency, and like the number of tokens that are used as our agent is completing a particular task, and we're tracking those deterministic metrics over time. We're also using the idea of LLM as a judge to evaluate the non-deterministic characteristics of our agent. So personality, and tone, and style, and output quality. We're using a non-deterministic grader as a part of our eval to evaluate our agent's non-deterministic characteristics. Now, we're going to run the evals for our agent in just a bit, but when you do, you'll find that the agent is struggling a bit. I'll talk about some of these evals in just a little bit more depth. So F1 on the screen, third from the bottom, this is essentially simulating a daily low stock sweep. So again, this is an inventory agent. We're simulating our ability to look through all of our inventory and pull the low levels of stock. This eval you'll find will actually fail because the agent is going to do the right thing, but it's going to take a very winding path to do so. So instead of taking the straightest line from point A to point B, the agent is going to take a very inefficient path. It's going to get to the right end, but it's going to fail the eval because it's not at the efficiency that we'd like. F2 on the screen is another eval that you'll see fail. This eval actually evaluates the ordering process under a particular promotion package. This is going to fail because we are using a sub-agent for this particular task. The sub-agent is actually getting the task right, but there's a communication breakdown between our sub-agent and our orchestrator. This is a really common point of failure that we see when customers have really complicated systems with a lot of sub-agents. It's important to get the communication between your sub-agents and your orchestrator just right. In the case of F2, like you see on the screen, this is an eval that's going to fail because we have a breakdown in that communication. The last one that I'll highlight that you'll see fails is R8 on the screen. R8 will essentially check the forecasting during a particular promotion month. This eval is also going to fail because we have two different policies that live in very different parts of our system prompt and actually end up contradicting each other. I mentioned over time our system prompt has grown. We start to have some conflicts and the model gets confused, leading towards a failure for this particular eval. In the repo, you'll see it in the README, when we run these evals, you'll see that they're going to pass up front at about 83%, which is okay, but if you work in the world of manufacturing, that is not okay. 17% failure is a really expensive failure percentage. Now let's double click on R8 again, just so that we can understand a little bit about what's happening behind the scenes. Again, R8 is where we're essentially calculating the forecast during a particular month with a promotion. On my screen here, on the right side, where you see kind of the simulated terminal window, within the first block under the commented text, we can see that the agent pulled the right forecasting baseline and also pulled the right promotion multiplier. Forecasting baseline, 12 units a day, promotion multiplier, 3.1x. This is all correct, but in the calculation part below that, we can see that there was actually some kind of hallucination that happened. Instead of using that 3.1x promo multiplier, the agent actually ended up using 1.35. So something happened along the way. A hint here is that the reason for this is that we have context problems. So this isn't a model problem. It's an issue with the information that we're surrounding the model with. Our system prompt has grown to be really long and is very confusing for the model and has some conflicts in it, which lead to the issue that shows up within this eval. So folks, our objective in this workshop will first be to run our suite of evals. We're going to triage the issues and we're going to update the design of our agent accordingly. And then we're going to do something that we call internally hill climbing towards eval improvement. So we run our evals, we get a baseline, it's going to be about 83%. We're then going to optimize the architecture of our agent and we're going to continue then running our eval so that we climb on them, hopefully seeing the success percentage improve over time. In this lab, we're also going to start with an agent that is self-created on our messages API. Again, if you have the repo and you click on the before folder, I'll show you this in just a bit. This is an agent that is built from scratch on our messages API. We're going to actually migrate that agent to cloud managed agents. Cloud managed agents essentially allows us to offload the messiness that comes with maintaining an agentic harness and scaling agents safely and securely to thousands and tens of thousands of users, right? Like if I want to build my agent locally and run it locally, I can do that pretty quickly and pretty easily. But the moment that I need to take that agent, I need to host it remotely and I need to allow hundreds and thousands of users to at the same time engage with that agent. There's an infrastructure problem. There's a scaling problem. There's memory, there's security. There's so much that I have to account for. So in order to offload that, so I can just worry about the architecture of my agent itself and make decisions around tools, skills, and sub-agents, I'm going to offload everything else to cloud managed agents. So again, to break that down just a bit, there's been a few talks on CMA so far today, but this is really where we're able to separate the agent from the session details from the sandboxed environment where tool calls are actually happening. Again, this allows us to offload particular parts of the stack to then only worry about the design of our agent itself. All right. I mentioned that we're going to get hands-on in this workshop. We are going to go ahead and do that right now. Now what you see on the screen here is the workshop URL as well, if you haven't had a chance to grab it, feel free to go ahead and do so. This is where we're keeping all of the different workshops throughout Code with Cloud within London so you can go back and revisit them if helpful. Within this workshop, we're going to be working on agent decomposition, so that's going to be the name of the folder that we're actually going to be working within. Great. Let me jump forward here. Perfect. The first thing that we're going to do as a part of this workshop is we're first going to get a baseline. When you open up that link, you'll first clone the repo. We're going to clone the repo locally. We have a UV project that's set up, so we're going to run UV sync in order to make sure that we have all of our packages and our dependencies to be able to invoke the Anthropic SDK and then eventually deploy our agent to Cloud-managed agents. We can run UV sync to do that. I mentioned previously that we're going to need an API key for this workshop as well. Using those credits that you got at the start of this session, you can go to your Cloud console account and create an API key. If you copy the ENV example, you'll just have to manually copy your API key into the ENV file that's created for you. All the 12 evals that I previously walked you through, we have all of those set up already. In order to get a baseline and run those evals, you have to run uv run evals dash dash agent before. This is all in the README, but if you just run that command, you will be able to actually go about running your evals. In terms of our building here, we're going to take a number of steps to actually go about running our evals, using Cloud Code to triage the results of them, and then climbing accordingly on our agent. We're first going to take a look at the system prompt that we have for our agent itself. I mentioned earlier that our system prompt is currently sitting at about 400 lines long. We've been stacking information on our system prompt over and over again as we've continued to get more business requirements. Our system prompt is very long. We'll take a look at that. We are then going to take some time to evaluate the tools that we're using. Right now, as I mentioned, we have 12 different tools. Three of them are actually wrapped sub-agents, so we'll take a look to see what we can do to make that more efficient. Then lastly, if there are any sub-agents that we really need to make our agent effective, we're going to take a look at the best way to actually construct sub-agents with Cloud-managed agents. I'm going to jump back just for a moment. There's one thing that I forgot to mention for you as you get started. Within the repo folder, there's two different folders that you'll see. There's a before folder, and then there's a starter folder. Those contain two separate agents. If you want to view the messages API version of the agent, again, this is just me building my own agent loop and my own agent harness around the Anthropic Messages API to invoke Cloud. You'll see that within the before folder. If you want to view what that agent looks like when deployed on Cloud-managed agents, you can look in the starter folder, which exists right below that. If you want to deploy your agent on Cloud-managed agents, you can run uv run deploy starter. Again, run your evals using the messages API version, dash dash agent before. You can then deploy your agent on Cloud-managed agents. We already had it built for you. It's really easy to use Cloud Code to compare the two and understand exactly what's going on and what some of the differences are with Cloud-managed agents. I'm going to jump over here, and we're just going to open up Cloud Code, and we are going to build together. I'm going to zoom in very far so that you can see everything and so that I can see everything. We'll just talk through exactly what happens when I run some of these evals, and we'll talk through the process that we usually go through to do what I just called hill climbing on the evals themselves. If you're looking at Cloud Code here, again, I just used Cloud Code to actually run my evals because I want Cloud's help in triaging what's going on. This is me. I'm using Cloud Code. I have Opus 4.7 running, as you can see on the screen. My effort level is set to extra high. I usually set effort as extra high with Opus 4.7, and I forget about it. That's the effort level that I usually stay on. We find that it gets great performance with extra high effort altogether. Now, you can see on the screen, the first thing that I did was I ran my eval. I used the bash capability in Cloud Code, and I ran uvrun evals dash dash agent before. Cloud actually went ahead and ran my eval. I'm going to scroll down, and we're going to look at what Cloud found while actually running those. You can see the response that we got, the results that we got from this eval run was actually lower than what I told you before. We ran them, and we got 62%, which is worse than the 83% that we started with. We passed seven out of 12 of them, and it looks like Cloud has provided us with a diagnosis for the different evals that we actually failed. Let's scroll down just a bit more. We are going to use Cloud to understand a little bit more about why this actually happened. You can see here, I am using Cloud to provide me some of the themes around why we actually failed some of these evals. Again, this is a great technique. If you have evals for your agent, again, as Giri showed before this session, you can use Cloud to actually go about triaging these. It looks like there's a few different themes that Cloud is figuring out based on this agent. The first thing, Cloud is seeing that our model is taking on a lot of work that it should have tools in order to do. Our model is doing a lot of reasoning across information that it just doesn't have the tools to be able to complete. It looks like there is some issues that we have with the enforcement of output structure. Our model and our sub-agents are producing information in a particular output structure that doesn't align exactly with what we're looking for to pull the best performance from our agent. If I continue to scroll down here, you can see there was some policy issues, et cetera. As I mentioned before, we have a system prompt that's really long. Cloud is seeing some confusions based on the information that's found within the system prompt. Again, you can see Cloud has found some root causes. Now, we're going to do a few different things here. Again, we're going to go one by one and address some of the areas that we're seeing issues on within our agent. I'm going to scroll down here, and we are going to use Cloud Code to triage some things within our agent. The first thing that I'm going to ask Cloud to do, we're going to talk through this. Cloud is making some changes, which is great. System prompts tend to get very, very long when we accumulate agents over time. The first prompt that I ran, if you're following along, feel free to go ahead and do this. I encouraged Cloud to look at my agent.py file, which is where our main CMA agent loop is located. Again, that's agent.py. I essentially said, hey, Cloud, do you have any thoughts on the system prompt? Maybe I can use skills instead of a long-running system prompt for progressive disclosure. The first thing that we'll talk about is skills. There's been a few other sessions on skills. The short definition that I like to use is that skills are packaged in composable information that Cloud has the ability to pull into context whenever Cloud realizes that it needs that information to complete a particular task. Skills are really useful with Cloud Code. If you need to provide Cloud information on your testing process, or if you want to package up your brand and your UI components and bundle them into a skill that Cloud can pull into context whenever needed, skills are fantastic. Skills are also useful within the agents that you're building for your customers. If you're building a product and you are going to give that product to customers, you're building an agent, skills are great within that. In the case of the agent that we have on the screen here, again, we have a lot of different policies and a lot of procedures that go into our inventory management system. As I accumulated requirements over time, instead of building skills, I decided to take all of that information and keep appending it to my system prompt. My system prompt got longer and longer and longer over time. This is not something that we recommend you do based on the introduction of skills. Leave the system prompt only for the information that Cloud needs in its mind, regardless of the task that you give it. Skills are fantastic for packaging information that Cloud is going to need some of the time, not all of the time. If I ask Cloud to go build a forecast, Cloud is going to go ahead and do that. Let's see, I lost my computer just for a second. Here we go. If I ask Cloud to go ahead and build a forecast, Cloud is not going to need forecasting information unless I specifically ask it to go ahead and build that forecast. In the case of that particular task, I want Cloud to pull forecasting information into its context window. Skills are also fantastic for making sure that you are being efficient with context because if you stuff all of this information into the system prompt, you're polluting that context window with information that Cloud does not need in order to complete a particular task. Again, the first thing that I did, I'll zoom in just a bit more so that you can see this and I'll scroll up just a bit. I said, hey, Cloud, can you help me take a look through my system prompt? Can I use skills instead? My system prompt is too long and I need some help. Cloud did an analysis of this and realized that I have some pre-built skills that I can use to supplement information in my system prompt. The first correction or fix that we're going to make to modernize our architecture here is we are actually going to remove much of the system prompt and we're going to put that information into skills. You can see here the first thing that we're doing with Cloud is we are activating a number of different skills that previously were not there before. We're actually swapping our system prompt to be a short prompt instead of a long one. If you're curious, if you feel like you have a long system prompt within the agents that you're building, feel free to take a look at this to see the differences between what was like a 400 line system prompt compared to about a 50 line system prompt. We've supplemented that and we've switched a lot of that information to skills. Great. I am now going to continue working with Cloud. You can see we made those changes here, which is fantastic. There's some evals that I can go rerun. I'm going to ask Cloud to do one more thing and then we're going to rerun some of our evals to see where we've improved. I mentioned before that we have 12 different tools. You saw those on the screen in the second slide that I shared. As a part of this inventory management agent, we have tools that we've created for everything. Whenever Cloud needs to retrieve data, we have a tool. Whenever Cloud needs to analyze data, we have a tool for that. We have tools for everything. I'm going to ask Cloud to take a look at the tools that my agent has and help me think through how I can optimize here. Right now, Cloud is running an analysis across the different tools that I have for my agent. We're going to get to see what some of the results for. While this is working, I'll give you a tip when it comes to building agents that we carry with us at Anthropic for our agents internally and the agents built with customers. Whenever we build agents, we lean into the same primitives that we as humans have access to. Imagine yourself when you show up to work. You have a computer that's sitting in front of you. You have the ability to navigate files on a file system. You can type in the browser and you can search the web. If you're an engineer, you have the ability to write and execute code. When you think about Cloud Code as an agent, we've effectively given Cloud access to all of the same primitives that you and I have access to when we show up to work every single day. Cloud Code is a great coding agent because Cloud is really good at code, but essentially what we've done with Cloud Code is we've just given Cloud access to a computer. This is really powerful because this allows us to drop in better versions of Cloud as we continue to release new models, and Cloud just uses those primitives better than it did before. Imagine yourself after this conference compared to yourself when you walked in. You're going to have the same tools at your fingertips, but theoretically your brain's going to be a little bit bigger. You're going to be smarter based on what you learned here, and you're going to be more effective while using the same tools. Cloud works the same exact way. Whenever we build agents, we lean into human-like primitives first. These primitives are things like code execution and the navigation of a file system, the keeping of a to-do list, the ability to search the web. These are foundational tools that we always start with when we build agents, and we remove them as needed. An example that I like to give is with file document analysis. If you're building an agent that requires document analysis, maybe you have a lot of CSVs or Excel sheets that your agent is going to be looking over, code execution, so the ability to write and run code, is one of the best ways of doing data analysis and working across lots of documents. If you need Cloud to look across a CSV, give Cloud a Bash tool so that Cloud can write a quick Python script and reason across the results after running that Python script is much more effective than just uploading the entire CSV into Cloud's context window. We lean into these computer-like primitives first when building an agent. If I scroll down here, that's exactly what we did here. You can see we took a lot of steps, and we actually removed most of the tools that exist within our agent, and we replaced them with some of the primitives that I talked through previously. This is an inventory management agent that leans really well to this. I have the ability to consolidate and remove a lot of the tools that I'm using to reason across Excels and reason across forecasting data, and just give Cloud access to the same tools that Cloud Code has in order to do that. What's cool about this is that when you build using Cloud-managed agents, these tools are actually included by default. If you want to give Cloud access to those same tools that Cloud Code has and use them to drive powerful capability within your agent, you don't have to worry about writing a tool that gives Cloud the ability to write and run code. You don't have to write a tool that gives Cloud the ability to use the file system. You can just rely on those built-in tools that we have built ourselves for Cloud Code that we just make available through Cloud-managed agents. I'm going to ask Cloud to rerun an eval, the evals to see if we are getting better. Now, with your agent, there's always going to be the need to add some custom tools as well. You're only going to get so far.

Summary

The workshop discusses managing complexity in AI agents, focusing on performance degradation due to bloated system prompts. It emphasizes the importance of using tools and skills effectively, and the benefits of cloud-managed agents for scalability. Participants will run evaluations to identify and address performance issues.

Key Points

Complex agents can degrade in performance over time.
Evaluate when to use tools, skills, or sub-agents.
Long system prompts can confuse agents and lead to failures.
Use skills to manage information instead of bloating system prompts.
Cloud-managed agents simplify scaling and maintenance.
Regularly run evals to assess agent performance and issues.

Tags

ai-agents cloud-management performance-evaluation inventory-management skills-and-tools

Repurpose Ideas

LinkedIn post: Key takeaways on AI agent management.
Tweet: Tips for optimizing AI agent performance.
Checklist: Steps to modernize your AI agent architecture.

Save videos. Search everything.

Build your personal library of inspiration. Find any quote, hook, or idea in seconds.

Create Free Account No credit card required

Original