Vibe Coding an AI Agent
A hands-on guide for product managers, product owners, designers, analysts, and anyone else who wants to understand AI agents by building one — without learning a programming language first.
Inspired by and adapted from Hendrixer/agents-v2 and the AI Agents v2 course on Frontend Masters by Scott Moss. The original course builds the agent in TypeScript; this edition reimagines the same architecture as a series of prompts you give to a coding agent.
Who This Book Is For
If you’ve ever:
- Sat in a meeting where engineers debated “tool calling vs. function calling” and felt lost,
- Read a blog post about AI agents and wanted to actually try one but stopped at “open your terminal,”
- Built a Notion doc full of agent ideas but had no way to validate them,
- Wanted to understand what your engineering team is shipping when they say “we built an agent,”
…this book is for you.
You don’t need to know Python. You don’t need to have written a line of code. You don’t need to understand what an API is (yet).
You do need:
- A computer (Mac, Windows, or Linux)
- A credit card to pay for one OpenAI API key (~$2 will cover everything in this book)
- A coding agent installed — we recommend Claude Code, but Cursor or GitHub Copilot Workspace will also work
- About 4–6 hours of focused time
That’s it. The coding agent writes the code. You drive.
What You’ll Build
By the end of this book, you’ll have a working CLI AI agent on your laptop that can:
- Read, write, and edit files on your computer
- Run shell commands
- Search the web
- Manage long conversations
- Ask for your permission before doing anything dangerous
It’s the same agent the Python edition builds — written in Python, with the same architecture. But instead of typing every line yourself, you’ll guide a coding agent through the build, one prompt at a time.
Why This Approach?
Three reasons.
1. You learn by building. Reading about agents is one thing. Watching code appear, running it, breaking it, and fixing it is something else entirely. The understanding sticks.
2. The coding agent is the future of software work. Whether or not you ever write code yourself, the people on your team will increasingly work with coding agents. Knowing what a good prompt looks like, how to verify output, and when to push back on the agent are core skills now — even for non-engineers.
3. The agent you build is real. It’s not a simulator or a toy. It’s the same architecture used by Claude Code, Cursor, and the agents your engineering team is shipping. By the end you’ll be able to look at any agent product and have an informed opinion on what’s hard, what’s easy, and what’s just hype.
How This Book Is Different
Each chapter follows the same seven-section format:
- What you’re building and why — The concept, in plain language. No jargon.
- The prompt — A copy-pasteable prompt for your coding agent.
- What you should see — Concrete expectations: which files appear, what they roughly contain.
- How to verify — One command to run, with the expected output.
- If it didn’t work — Three to five common failure modes and recovery prompts.
- Reference code — The canonical version (collapsed). Compare against it if you want.
- What you just learned about agents — The takeaway, in product-manager terms.
You don’t need to read the reference code. It’s there if your coding agent produces something weird and you want to see what should have happened.
A Note on Coding Agents and Drift
Coding agents are non-deterministic. Two readers running the same prompt may get slightly different code. That’s fine — what matters is that the behavior matches what we describe in “How to verify.”
If your agent’s output diverges from the reference code in surface ways (different variable names, different file structure) but the verification step still passes, you’re done. Move on.
If the verification fails, the “If it didn’t work” section will get you unstuck most of the time. If you’re still stuck after that, the reference code is your safety net.
Tech Stack
You don’t need to know any of this. Your coding agent does. It’s listed here so you recognize the words when they appear:
- Python 3.11+ — The language the agent is written in
- OpenAI SDK — How we talk to the LLM
- Pydantic — How we describe what tools take as input
- Rich + Prompt Toolkit — How we make the terminal look nice
Acknowledgments
This edition is the same agent as the Python, TypeScript, Rust, Go, and Java editions — just built through prompts instead of code. If at any point you want to see the “answer key,” the Python edition has the full hand-written walkthrough.
Ready? Let’s set up your coding agent.
Next: Chapter 0: Setting Up Your Coding Agent →
Chapter 0: Setting Up Your Coding Agent
This is the only chapter where you’ll do “setup” work. Once your coding agent is running and your API keys are in place, every other chapter is just: paste a prompt, watch it work, run one verification command.
If you get stuck here, that’s normal. Most people get stuck on environment setup at least once. The good news: you only have to do it once.
What You’re Building and Why
You need three things on your computer before Chapter 1:
- A coding agent — the AI that will write Python code on your behalf
- An OpenAI API key — so the agent you build can talk to a model
- Python 3.11 or newer — the language the agent will be written in
Think of it like setting up a new kitchen before cooking. We’re laying out the tools so the actual cooking (Chapters 1–3) can be uninterrupted.
Step 1: Pick a Coding Agent
We recommend Claude Code for this book. It’s the most capable terminal-based coding agent at the time of writing, it handles multi-file projects well, and the prompts in this book have been tested against it.
Other coding agents that will work:
| Coding Agent | Works for this book? | Notes |
|---|---|---|
| Claude Code | Yes (recommended) | Best fit for the prompt style we use |
| Cursor | Yes | IDE-based; you’ll paste prompts into the chat panel |
| GitHub Copilot Workspace / Codex | Yes | Similar workflow to Cursor |
| ChatGPT + manual copy/paste | Works but tedious | You’ll be the file system |
The rest of this chapter assumes Claude Code. If you’re using a different agent, the prompts are the same — just paste them into wherever your agent takes input.
Install Claude Code
Open your Terminal app (on Mac: ⌘+Space, type “Terminal”; on Windows: search “PowerShell”), and paste:
curl -fsSL https://claude.ai/install.sh | sh
Then run:
claude --version
You should see a version number. If you get “command not found,” close and reopen your terminal and try again.
For the latest install instructions, see Claude Code docs.
Sign in
Run:
claude
The first time you run it, it’ll walk you through signing in. Use your Anthropic account — if you don’t have one, create it at claude.ai. Claude Code uses your Anthropic subscription or pay-as-you-go credits; it does not use your OpenAI key (that’s for the agent you’re building, not the agent that’s helping you).
Step 2: Get an OpenAI API Key
The agent you build in this book talks to OpenAI’s models. You need one API key.
- Go to platform.openai.com
- Sign up or log in
- Click your profile icon → API keys → Create new secret key
- Copy the key. It starts with
sk-and is about 50 characters long. - Paste it somewhere safe for now (a sticky note, password manager, anywhere you can find it again in five minutes)
You’ll also need to add a few dollars of credit to your OpenAI account: Settings → Billing → Add payment method. The entire book uses well under $5 of credit.
Why both an Anthropic and OpenAI account? Claude Code (your helper) is made by Anthropic. The agent you’re building uses OpenAI models because that’s what the original course uses. There’s no technical reason — you could rebuild this whole book using Claude or Gemini models with small prompt tweaks. We’re staying on OpenAI to match the other editions exactly.
Step 3: Make Sure Python Is Installed
In your terminal:
python3 --version
If you see Python 3.11.x or higher, you’re done with this step.
If you see something lower (like Python 3.9) or “command not found”:
- Mac: Install Homebrew, then run
brew install python@3.12 - Windows: Install from python.org/downloads — make sure to check “Add Python to PATH” during install
- Linux:
sudo apt install python3.12(or your distro’s equivalent)
Then run python3 --version again to confirm.
Step 4: Create Your Project Folder
Pick a folder where you want this project to live. Anywhere works — your Desktop, your Documents folder, wherever. In your terminal:
mkdir agents-v2
cd agents-v2
This creates an empty folder and moves into it. From here on, every prompt assumes you’re inside this folder.
Step 5: Start Claude Code in Your Project
From inside the agents-v2 folder, run:
claude
Claude Code will start up and show you a prompt. This is where you’ll paste every prompt in this book.
Try a smoke test. Paste this:
What folder are we in, and is it empty?
Claude Code should respond with something like “We’re in /Users/.../agents-v2 and the folder is empty.” If it does — congratulations, your coding agent is working.
How to Verify Everything Is Ready
Run these four commands in your terminal (outside Claude Code) and check the output:
claude --version # should print a version
python3 --version # should print 3.11 or higher
pwd # should end in /agents-v2
ls # should print nothing (empty folder)
If all four look right, you’re set.
If It Didn’t Work
“claude: command not found”
Close your terminal completely and reopen it. The install script adds claude to your shell’s PATH, and that change only takes effect in new terminal windows.
“python3: command not found”
On Windows, you may need python instead of python3. Try python --version. If that works, mentally substitute python for python3 in the rest of this book.
Claude Code keeps asking me to log in
You may have multiple Anthropic accounts. Run claude logout then claude and sign in again with the account that has credits.
My OpenAI account says “you must add a payment method” You do — even if you have free trial credit, OpenAI requires a card on file before issuing API keys. Add one in Settings → Billing.
Chapter 1 fails immediately with “incorrect API key provided”
Your OpenAI key is wrong, or there’s a stray space when you pasted it. Re-copy from platform.openai.com and try again. The key should start with sk- and be one continuous string with no spaces.
My coding agent does something completely different from what the prompt asks
This happens occasionally. Try the prompt again in a fresh Claude Code session (/clear inside Claude Code, or quit and restart). If it still misbehaves, it usually means the prompt was ambiguous in your context — re-read the “What you should see” section and tell the agent more specifically what you wanted.
What You Just Learned About Agents
Two things, actually.
First: agents need API keys, money, and an environment. Every “magical” AI product you’ve used had someone do this exact setup. Knowing where the keys come from, what they cost, and where they live demystifies a huge part of how AI products are deployed.
Second: you just used a coding agent to verify your environment. When you asked Claude Code “what folder are we in, is it empty?”, it ran shell commands, read the output, and summarized them for you. That’s exactly the loop you’re going to build in Chapter 4: an LLM that can call tools, see the results, and respond. You’ve been the agent’s user. Soon you’ll have built one of your own.
Next: Chapter 1: Your First LLM Call →
Chapter 1: Your First LLM Call
The smallest possible “AI program” is one that sends a question to a language model and prints the answer. No tools, no loop, no agent — just a single round-trip.
That’s what you’re going to build in this chapter, by handing a single prompt to your coding agent.
What You’re Building and Why
Three files:
- A project config (
pyproject.toml) — declares the project exists - A dependencies list (
requirements.txt) — says which Python libraries to install - A main script (
src/main.py) — sends one question to OpenAI and prints the response
Plus a .env file holding your OpenAI key.
Why bother with this if it’s “just” one LLM call? Because every agent in the world starts here. The agent loop, tool calling, evals, and the terminal UI you’ll add in later chapters are all wrappers around this one primitive: ask a model, get an answer back. Get this working and the rest of the book is incremental.
Concept to understand before you read the prompt: an LLM is a paid web service. Your code sends an HTTP request to OpenAI’s servers with your question, OpenAI’s servers run the model, and they send back the answer along with how many tokens it used. The OpenAI Python SDK is just a polite wrapper around that HTTP request.
The Prompt
Open Claude Code (or your coding agent) inside your agents-v2 folder. Paste this prompt as one block:
I'm building a Python CLI AI agent from scratch over the course of a book.
Please set up the absolute minimum project so I can make one LLM call to OpenAI.
Requirements:
1. Create a virtual environment using `python3 -m venv .venv` and explain how to activate it on macOS/Linux and Windows.
2. Create requirements.txt with these exact pinned versions or higher:
- openai>=1.82.0
- pydantic>=2.11.0
- rich>=14.0.0
- prompt-toolkit>=3.0.50
- python-dotenv>=1.1.0
3. Create pyproject.toml declaring a package called "agi" version 1.0.0 requiring Python 3.11+.
4. Create a .gitignore that ignores .venv, __pycache__, .env, and *.pyc.
5. Create a .env file with a single line: OPENAI_API_KEY=replace-me
6. Create src/__init__.py (empty) and src/main.py.
src/main.py should:
- Load environment variables from .env using python-dotenv
- Create an OpenAI client (it picks up OPENAI_API_KEY automatically)
- Call client.chat.completions.create with model "gpt-5-mini"
- Send a single user message: "What is an AI agent in one sentence?"
- Print response.choices[0].message.content
7. After creating the files, install the dependencies into the venv with pip.
8. Tell me the exact command to run the script, and what I should see.
Do not add a system prompt yet. Do not add tools yet. Do not add streaming yet.
Keep main.py under 20 lines. I want this to be as small as possible.
Hit enter and let it run. The agent will create files, run pip install, and tell you the command to test it.
What You Should See
When the agent finishes, your folder should look roughly like this:
agents-v2/
├── .env
├── .gitignore
├── .venv/
├── pyproject.toml
├── requirements.txt
└── src/
├── __init__.py
└── main.py
src/main.py should be a tiny file — about 12–15 lines. It should import os, dotenv, and openai, call load_dotenv(), instantiate an OpenAI client, send one chat completion request, and print the result.
The agent should also have run pip install -r requirements.txt and output a confirmation that the packages are installed.
How to Verify
First, replace replace-me in .env with your actual OpenAI key:
# Open .env in any editor and change:
# OPENAI_API_KEY=replace-me
# to:
# OPENAI_API_KEY=sk-...your-actual-key...
Then activate your virtual environment (the agent should have told you how — typically source .venv/bin/activate on Mac/Linux or .venv\Scripts\activate on Windows) and run:
python -m src.main
You should see something like:
An AI agent is an autonomous system that perceives its environment, makes decisions, and takes actions to achieve specific goals.
The exact wording will be different every time you run it — that’s expected. LLMs are non-deterministic. As long as you get one English sentence describing an AI agent, you’re done.
If It Didn’t Work
ModuleNotFoundError: No module named 'openai'
You’re not in your virtual environment. Run source .venv/bin/activate (Mac/Linux) or .venv\Scripts\activate (Windows) and try again. Your terminal prompt should show (.venv) at the start.
openai.AuthenticationError: Incorrect API key provided
Your OpenAI key in .env is wrong, missing, or has whitespace around it. Re-copy from platform.openai.com/api-keys, make sure it starts with sk-, and there are no quotes or spaces.
openai.RateLimitError: You exceeded your current quota
You haven’t added a payment method or you’re out of credit. Go to platform.openai.com/account/billing.
Model 'gpt-5-mini' does not exist or you do not have access to it
This model name is what the rest of the book uses. If your OpenAI account doesn’t have access yet, ask your coding agent: “Change the model in src/main.py from gpt-5-mini to gpt-4o-mini.” Everything in the book will still work.
The agent created a much bigger project than I asked for Tell it: “This is too much. Delete everything except .env, .gitignore, requirements.txt, pyproject.toml, src/init.py, and src/main.py. Keep main.py under 20 lines.” Coding agents sometimes “help” by adding logging, error handling, or class wrappers. For learning, smaller is better.
Reference Code
If you want to see what src/main.py should look like, here’s the canonical version from the Python edition:
src/main.py (click to expand)
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "user", "content": "What is an AI agent in one sentence?"}
],
)
print(response.choices[0].message.content)
Your version may look slightly different — different variable names, an extra blank line here or there. As long as the verification step works, you’re fine.
What You Just Learned About Agents
You just shipped the simplest possible LLM application: input → model → output. No memory, no tools, no loop. It cost a fraction of a cent and took one round-trip to OpenAI’s servers.
Three things to internalize from this:
1. The model is a function. Despite all the hype, an LLM API call is conceptually output = model(input). Everything else — tools, streaming, agents, RAG — is what you build around that function. When someone says “we built an AI feature,” 80% of the time they built something with this one primitive at the center.
2. Determinism is gone. You ran the same code twice and got two different answers. Every product decision around AI has to account for this. Tests, evals, and user-facing copy all have to assume the model will sometimes say something different.
3. You paid for that. Look at your OpenAI usage dashboard. That call cost a fraction of a cent. Now imagine 10 million users hitting it. Cost is a first-class product concern with LLM features in a way it isn’t with traditional software, where compute is essentially free.
In Chapter 2, you’ll teach this same primitive to use tools — and that’s where it stops being a chatbot and starts being an agent.
Next: Chapter 2: Tool Calling →
Chapter 2: Tool Calling
This is the chapter where your program stops being a chatbot and starts being an agent.
What You’re Building and Why
In Chapter 1, you sent a question and got an answer. The model couldn’t do anything — it could only talk about doing things.
In this chapter, you’ll teach the model two tools:
read_file(path)— read the contents of a filelist_files(directory)— list what’s in a folder
You won’t write a loop yet, and the model won’t actually call the tools and see the results — that’s Chapter 4. What you will do is hand the model a list of available tools and watch it choose the right one.
The single most important concept in this chapter: the LLM does not run your tools. It outputs structured JSON saying “please call list_files with directory .”. Your code reads that JSON and decides whether to actually run anything. The LLM is the brain. Your code is the hands. This separation is what makes agents safe to operate — your code is always the gatekeeper.
By the end of this chapter, when you ask the model “what files are in this folder?”, instead of saying “I can’t see your files,” it’ll output a tool call. That’s the moment it stops being a chatbot.
The Prompt
In the same Claude Code session, paste this:
Continuing the agent build. I want to add tool calling — defining two file
tools and showing they get selected by the LLM. We are NOT building the agent
loop yet. We just want to see the LLM pick a tool.
Please make these changes:
1. Create src/agent/system/__init__.py (empty) and src/agent/system/prompt.py
containing a SYSTEM_PROMPT constant. Make it short — about 5 lines —
describing a helpful AI assistant that's direct, honest, and stays focused.
2. Create src/agent/__init__.py (empty), src/agent/tools/__init__.py, and
src/agent/tools/file.py.
3. In src/agent/tools/file.py define:
- read_file_execute(args: dict) -> str that opens args["path"] and
returns its contents. On FileNotFoundError return a clear error string.
On any other Exception return "Error reading file: <message>".
- list_files_execute(args: dict) -> str that lists args.get("directory", ".")
and returns one entry per line. Prefix directories with "[dir] " and
files with "[file] ". Sort alphabetically. Handle FileNotFoundError and
generic Exception with clear strings.
- READ_FILE_TOOL — an OpenAI-format tool definition with type "function",
name "read_file", a description that tells the LLM exactly when to use
it, and a single required string parameter "path".
- LIST_FILES_TOOL — same shape, name "list_files", a single string
parameter "directory" with default "." and NOT required.
4. In src/agent/tools/__init__.py expose:
- ALL_TOOLS = [READ_FILE_TOOL, LIST_FILES_TOOL]
- TOOL_EXECUTORS = {"read_file": read_file_execute,
"list_files": list_files_execute}
5. Create src/agent/execute_tool.py with one function:
execute_tool(name: str, args: dict) -> str
that looks up the executor in TOOL_EXECUTORS, runs it, and returns the
string. If the tool name is unknown, return "Unknown tool: <name>".
Catch exceptions during execution and return "Error executing <name>: <e>".
6. Update src/main.py so it:
- Loads .env
- Imports SYSTEM_PROMPT and ALL_TOOLS
- Calls client.chat.completions.create with model "gpt-5-mini",
messages = [system, user], where the user message is
"What files are in the current directory?"
- Passes tools=ALL_TOOLS
- Prints message.content (which will likely be None)
- Prints any tool calls as JSON: name + parsed arguments
Do NOT add an agent loop. Do NOT execute the tool result. We just want to see
that the LLM responds with a tool call instead of text.
Important: tool descriptions matter a lot. Make them specific about WHAT the
tool does and WHEN to use it — not just "file tool".
What You Should See
After the agent finishes, your project should look like:
agents-v2/
├── .env
├── .gitignore
├── pyproject.toml
├── requirements.txt
├── src/
│ ├── __init__.py
│ ├── main.py
│ └── agent/
│ ├── __init__.py
│ ├── execute_tool.py
│ ├── system/
│ │ ├── __init__.py
│ │ └── prompt.py
│ └── tools/
│ ├── __init__.py
│ └── file.py
src/agent/tools/file.py should have two _execute functions and two _TOOL dictionaries. The dictionaries follow OpenAI’s tool format — you’ll see keys like type, function, name, description, parameters.
src/main.py should be a bit longer than Chapter 1 — maybe 25–30 lines now — because it imports the tools and prints both content and tool_calls.
How to Verify
Activate your venv and run:
python -m src.main
You should see something like:
Text: None
Tool calls: [
{
"name": "list_files",
"args": {
"directory": "."
}
}
]
The two things to check:
TextisNone(or empty) — the LLM did not respond with prose. It chose to call a tool instead.Tool callshas exactly one entry, and it’slist_files— the LLM picked the right tool for the question.
If both of those are true, tool calling is working. The LLM has understood that to answer “what files are in the current directory?” it needs to call a tool, and it has chosen the correct one of the two you offered.
Try asking it to read a specific file too. Edit the user message in src/main.py to:
{"role": "user", "content": "What does the file pyproject.toml contain?"}
…and run again. You should now see a read_file call with {"path": "pyproject.toml"}.
If It Didn’t Work
Text is not None — the LLM responded with prose like “I can’t access your files.”
The tool descriptions are probably weak. Tell your coding agent: “The LLM is responding with text instead of calling a tool. Make the descriptions in READ_FILE_TOOL and LIST_FILES_TOOL more specific about what they do and when to use them.” Good descriptions trigger tool use; vague ones don’t.
It calls the wrong tool (e.g., read_file when the question was about listing).
Same fix — sharper descriptions. Each description should make it obvious when not to use it. You can also tell the agent: “The LLM picks read_file when I ask about listing. Update the descriptions so they’re clearly distinct.”
KeyError: 'path' or similar when the agent tries to actually run the tool.
Don’t worry about this in this chapter. We’re not executing tools yet. You should only be printing the tool call, not running it. If your main.py is trying to execute the tool, tell the coding agent: “Don’t execute the tool. Just print the tool call. We’ll add execution in the agent loop chapter.”
It calls multiple tools at once. Some models will call several tools in parallel. That’s fine and expected behavior — the API supports it. Just print all of them.
ImportError after the changes.
Your coding agent forgot to add an __init__.py somewhere. Run:
find src -type d
…and tell the agent: “There’s an ImportError. Make sure every folder under src/ has an init.py.”
Reference Code
src/agent/tools/file.py (click to expand)
import os
from typing import Any
def read_file_execute(args: dict[str, Any]) -> str:
file_path = args["path"]
try:
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
except FileNotFoundError:
return f"Error: File not found: {file_path}"
except Exception as e:
return f"Error reading file: {e}"
def list_files_execute(args: dict[str, Any]) -> str:
directory = args.get("directory", ".")
try:
entries = os.listdir(directory)
items = []
for entry in sorted(entries):
full_path = os.path.join(directory, entry)
entry_type = "[dir]" if os.path.isdir(full_path) else "[file]"
items.append(f"{entry_type} {entry}")
return "\n".join(items) if items else f"Directory {directory} is empty"
except FileNotFoundError:
return f"Error: Directory not found: {directory}"
except Exception as e:
return f"Error listing directory: {e}"
READ_FILE_TOOL = {
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file at the specified path. Use this to examine file contents.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "The path to the file to read",
}
},
"required": ["path"],
},
},
}
LIST_FILES_TOOL = {
"type": "function",
"function": {
"name": "list_files",
"description": "List all files and directories in the specified directory path.",
"parameters": {
"type": "object",
"properties": {
"directory": {
"type": "string",
"description": "The directory path to list contents of",
"default": ".",
}
},
},
},
}
The full canonical version (with the registry and dispatcher) is in the Python edition Chapter 2.
What You Just Learned About Agents
Three takeaways, in priority order.
1. The LLM is a planner. Your code is the executor. The most important architectural fact about agents is this separation. The LLM never touches your file system, your database, your API. It outputs structured JSON saying what it wants to do. Your code decides whether to actually do it. This is why you can build safe agents on top of unsafe-sounding capabilities like “run shell commands”: you control the execution layer, and you can refuse, modify, or audit any tool call before it runs. You’ll feel this directly in Chapter 9 when you add a “yes/no” approval prompt for dangerous operations.
2. Tool descriptions are product copy. The single most underrated skill in building agents is writing tool descriptions. They’re not documentation for humans — they’re prompts for the LLM that determine whether your tools get used at all. A tool with a vague description (“file utility”) will be ignored. A tool with a sharp description (“Read the contents of a file at the specified path. Use this to examine file contents.”) will be picked correctly. When your engineering team is building agents, ask to see the tool descriptions. They tell you more about reliability than the code.
3. The LLM can fail the task even when the call is perfect. Your model might pick the wrong tool, hallucinate a parameter, or call a tool when it should have just answered with text. There is no compiler that catches this. The only way to know if your agent reliably picks the right tool is to test it on lots of inputs. That’s what Chapter 3 is about — building an automated test harness for tool selection. It’s the most “engineering-y” chapter in the book, and it’s the one that separates demo agents from production agents.
Next: Chapter 3: Single-Turn Evaluations →
Chapter 3: Single-Turn Evaluations
This is the chapter that turns your project from “I built a thing once and it worked” into “I have a way to know if it still works tomorrow.”
If you remember one chapter from this book by name, make it this one. Evals are the single most underrated part of building AI products, and they’re the topic non-engineers most need to understand to be useful in agent conversations at work.
What You’re Building and Why
In Chapter 2, you saw the LLM correctly call list_files when you asked “what files are in the current directory?”. Great. But what about:
- “show me what’s in the project”
- “I want to read README.md”
- “what is the capital of France?” (should NOT call any tool)
- “tell me a joke” (should also NOT call any tool)
Will the LLM pick the right tool — or no tool — every single time? You don’t actually know. You ran one test. You’d need to type each prompt manually, eyeball the output, and remember whether it was right.
That doesn’t scale. Worse, every time you tweak a tool description, change the system prompt, or upgrade the model, all of your previous “yeah it worked” evidence becomes obsolete.
Evals are an automated test runner for LLM behavior. You write a list of test cases that look like:
prompt: "Read the contents of README.md"
expected: the LLM should call read_file
…and a script runs all of them and tells you what pass-rate you got. That’s the entire concept.
In this chapter you’ll build:
- A test dataset — a JSON file with prompts and expectations
- An executor — code that sends a prompt to the LLM and records which tool it picked (without actually running the tool)
- Evaluators — three small scoring functions for three test categories
- A runner script — prints pass/fail for each test and an overall score
The three test categories are:
| Category | Meaning | Example |
|---|---|---|
| Golden | The LLM MUST pick this exact tool | “Read README.md” → must pick read_file |
| Secondary | Ambiguous; LLM SHOULD pick a reasonable tool | “Show me the project” → probably list_files |
| Negative | The LLM MUST NOT pick any of these tools | “What’s 2+2?” → must not call any file tool |
The Prompt
In Claude Code, paste this:
Continuing the agent build. I want to add single-turn evaluations for tool
selection. We are NOT executing tools or building an agent loop. We are just
checking which tool the LLM picks for each test prompt.
Please make these changes:
1. First, expose FILE_TOOLS in src/agent/tools/__init__.py:
FILE_TOOLS = [READ_FILE_TOOL, LIST_FILES_TOOL]
(Plus the existing ALL_TOOLS and TOOL_EXECUTORS — keep those.)
2. Create the evals package: evals/__init__.py (empty), evals/data/.
3. Create evals/types.py with three dataclasses:
- EvalData(prompt: str, tools: list[str], system_prompt: Optional[str] = None)
- EvalTarget(category: str, expected_tools: Optional[list[str]] = None,
forbidden_tools: Optional[list[str]] = None)
category is one of "golden", "secondary", "negative".
- SingleTurnResult(tool_calls: list[dict], tool_names: list[str],
selected_any: bool)
4. Create evals/utils.py with one function build_messages(data: dict) that
returns [{"role": "system", "content": SYSTEM_PROMPT or override},
{"role": "user", "content": data["prompt"]}].
Import SYSTEM_PROMPT from src.agent.system.prompt.
5. Create evals/executors.py with single_turn_executor(data: dict,
available_tools: list[dict]) -> SingleTurnResult. It must:
- Build messages from data
- Filter available_tools to only those whose name is in data["tools"]
- Call client.chat.completions.create with model "gpt-5-mini",
messages, and tools (or None if empty)
- Parse message.tool_calls into tool_calls (list of {tool_name, args})
and tool_names (list of strings)
- Return SingleTurnResult
Use a module-level OpenAI() client.
IMPORTANT: do NOT execute the tools. We only want which tool was selected.
6. Create evals/evaluators.py with three functions:
- tools_selected(output, target) -> float
Returns 1.0 if every tool in target.expected_tools appears in
output.tool_names, else 0.0. If expected_tools is None/empty, return 1.0.
- tools_avoided(output, target) -> float
Returns 1.0 if NONE of target.forbidden_tools appears in
output.tool_names, else 0.0. If forbidden_tools is None/empty, return 1.0.
- tool_selection_score(output, target) -> float
F1 score (precision + recall harmonic mean) of selected vs expected.
Used for "secondary" category.
7. Create evals/data/file_tools.json with at least 5 test cases covering:
- 2 golden cases (read_file and list_files specifically)
- 1 secondary case (ambiguous prompt that should still pick a file tool)
- 2 negative cases ("what is the capital of France?", "tell me a joke")
Each case has shape:
{ "data": { "prompt": "...", "tools": ["read_file","write_file","list_files","delete_file"] },
"target": { "category": "golden", "expected_tools": ["read_file"] } }
For negative cases use forbidden_tools instead of expected_tools.
Note: include "write_file" and "delete_file" in the available tools list
even though we haven't built them yet — the evaluator only filters the
ones that actually exist in FILE_TOOLS, so it's fine.
8. Create evals/file_tools_eval.py that:
- Loads .env
- Loads evals/data/file_tools.json
- For each entry: builds an EvalTarget, calls single_turn_executor with
FILE_TOOLS, and runs the right evaluator based on target.category
(tools_selected for golden, tools_avoided for negative,
tool_selection_score for secondary)
- Prints a checkmark or X, the prompt, the selected tools, and the score
- Prints an overall average at the end
9. Tell me the exact command to run the eval.
Do NOT integrate Laminar yet. Do NOT add multi-turn or LLM-as-judge. Pure
single-turn tool selection with local pass/fail.
What You Should See
Your project should now have an evals/ directory next to src/:
agents-v2/
├── src/
│ └── ...
└── evals/
├── __init__.py
├── types.py
├── utils.py
├── executors.py
├── evaluators.py
├── file_tools_eval.py
└── data/
└── file_tools.json
file_tools.json should have at least 5 entries. Open it and read it — it’s just a list of {data, target} objects in plain English. This is the most important artifact in the chapter. It’s the document you’d hand to a non-engineer stakeholder when they ask “how do you know your agent works?”
How to Verify
Activate your venv and run:
python -m evals.file_tools_eval
You should see output like:
File Tools Evaluation
========================================
✓ [golden] Read the contents of README.md
Selected: ['read_file']
Scores: {'tools_selected': 1.0}
✓ [golden] What files are in the src directory?
Selected: ['list_files']
Scores: {'tools_selected': 1.0}
✓ [secondary] Show me what's in the project
Selected: ['list_files']
Scores: {'selection_score': 1.0}
✓ [negative] What is the capital of France?
Selected: []
Scores: {'tools_avoided': 1.0}
✓ [negative] Tell me a joke
Selected: []
Scores: {'tools_avoided': 1.0}
Average score: 1.00
Two things to check:
- Each test case has a checkmark.
- The average score is at or near 1.00.
If you see one or two failures, that’s actually good — it means your eval is detecting real LLM unreliability. Read the failure carefully. Often you’ll find the prompt was genuinely ambiguous, or the tool description needs tightening. Both are valid fixes.
Now do something interesting: change the description of read_file in src/agent/tools/file.py to something vague like "A tool for files." and re-run the eval. Watch what happens. (Then change it back.)
If It Didn’t Work
ModuleNotFoundError: No module named 'evals'
You’re running it from the wrong place, or evals/__init__.py is missing. Make sure you’re in the project root (agents-v2/) and that evals/__init__.py exists. Run ls evals/__init__.py to verify.
ImportError: cannot import name 'FILE_TOOLS' from 'src.agent.tools'
Step 1 of the prompt didn’t get applied. Tell your coding agent: “You forgot to add FILE_TOOLS to src/agent/tools/init.py. Please add it.”
All my golden cases fail with Selected: []
The LLM decided not to call any tool. Usually this means your tool descriptions are too vague. Tell the agent: “My golden eval cases are failing because the LLM isn’t calling any tool. Sharpen the descriptions in src/agent/tools/file.py to be more specific about when each tool should be used.”
One specific golden case fails consistently The prompt in your test case is genuinely ambiguous, OR the tool descriptions are confusable. Re-read the prompt and ask yourself: if I were the LLM, would I be sure? Either rewrite the prompt to be more specific, or improve the descriptions.
The negative cases fail — the LLM calls a tool when asked about France This happens with weaker models. It means the tool descriptions are over-broad. They sound like “use me for any question,” and the LLM takes that literally. Tighten the descriptions.
Different runs give different pass rates Yes — LLMs are non-deterministic. This is why a single run isn’t proof of anything. Mature eval setups run each case multiple times and report a rate (e.g., “passes 9/10 runs”). For this chapter, one clean run is enough; you’ll see the rate-based version when you read AI Engineering (recommended in Chapter 10).
Reference Code
evals/data/file_tools.json (click to expand)
[
{
"data": {
"prompt": "Read the contents of README.md",
"tools": ["read_file", "write_file", "list_files", "delete_file"]
},
"target": {
"expected_tools": ["read_file"],
"category": "golden"
}
},
{
"data": {
"prompt": "What files are in the src directory?",
"tools": ["read_file", "write_file", "list_files", "delete_file"]
},
"target": {
"expected_tools": ["list_files"],
"category": "golden"
}
},
{
"data": {
"prompt": "Show me what's in the project",
"tools": ["read_file", "write_file", "list_files", "delete_file"]
},
"target": {
"expected_tools": ["list_files"],
"category": "secondary"
}
},
{
"data": {
"prompt": "What is the capital of France?",
"tools": ["read_file", "write_file", "list_files", "delete_file"]
},
"target": {
"forbidden_tools": ["read_file", "write_file", "list_files", "delete_file"],
"category": "negative"
}
},
{
"data": {
"prompt": "Tell me a joke",
"tools": ["read_file", "write_file", "list_files", "delete_file"]
},
"target": {
"forbidden_tools": ["read_file", "write_file", "list_files", "delete_file"],
"category": "negative"
}
}
]
The full canonical version of all the eval modules is in the Python edition Chapter 3.
What You Just Learned About Agents
This is the chapter with the highest leverage for product roles. Three takeaways.
1. Evals are your contract with reality. Every team building AI features should have an answer to: “How do you know it works?” If the answer is “we tried it and it seemed good,” the team has no eval discipline. If the answer is “we have N test cases across golden/secondary/negative categories and we’re at 94% pass rate, with the 6% failures triaged by category,” the team is shipping responsibly. As a PM, the most valuable single question you can ask of your AI engineering team is: “can I see the eval set?” If they don’t have one, that’s the product risk to escalate. Not “can it do X?” but “how do you measure whether it does X reliably?”
2. There are three flavors of correctness. The golden / secondary / negative split is not just an implementation detail — it’s how you should think about every AI feature.
- Golden: the things it absolutely must get right. (“When the user says ‘cancel my subscription,’ it must trigger the cancellation flow.”)
- Secondary: the things where reasonable behavior is acceptable. (“When the user says ‘I want to leave,’ it should probably offer cancellation, but offering retention is fine too.”)
- Negative: the things it must never do. (“When the user says ‘tell me a joke,’ it must not start a refund.”)
- The eval set should explicitly include all three. Most teams only test the goldens. The negatives are where the lawsuits come from.
3. Test data is more important than test code. Look at the eval framework you just built. It’s about 100 lines of code. The actual value lives in file_tools.json — the list of prompts and expected behaviors. When you upgrade your model from GPT-5-mini to GPT-6-mini, that JSON file is what tells you whether the upgrade is safe. The code is a runner; the data is the asset. Treat it as a first-class artifact: version it, review changes to it, and grow it whenever you discover a new failure mode in production. “Every bug report becomes an eval case” is a habit worth pushing your engineering team to adopt.
You now have the three pillars of every agent: a model call, tool definitions, and a way to verify behavior. The next six chapters add capability — the loop, more tools, web search, context management, and the human-in-the-loop UI. But the architecture you’ve already built is the load-bearing wall. Everything from here is decoration.
Congratulations — you’ve built and tested the foundation of an AI agent without writing a line of code yourself. The remaining chapters of this book follow the same prompt-driven format. When you’re ready, continue to the Python edition for the canonical walkthrough of Chapters 4–10 — or, when this vibe-coding edition expands, come back here.