Logging and observability
ShopBot now does a lot behind the scenes: it loops, calls tools, validates arguments, asks for approvals, hits a real API, recovers from errors. When it works, great. But when it misbehaves in production — answers oddly, calls the wrong tool, refunds too much — you need to answer one question: what did it actually do, and why? That's observability, and the first step is logging.
print was fine for learning — not for production
Up to now the harness used print() to show what it was doing. That's perfect
for a tutorial, but real systems need more:
- levels — separate routine info from genuine problems,
- timestamps — know when each step happened,
- a switch — turn detail up when debugging, down in normal operation,
- routing — send logs to a file or a monitoring service, not just the screen.
Python's built-in logging module gives all of that. So we replace the harness's
internal print()s with proper logs.
Two audiences: the customer and the operator
There's a clean line to draw first. Two very different readers care about a run:
- The customer sees one thing: the final answer. That stays a
print(it's the actual product output). - You, the operator, need the play-by-play: every round, tool call, result, and error. That becomes logging.
Keeping them separate means you can crank up logging detail without spamming the customer, and vice versa.
A good rule: if a line is meant for the person running the agent, it's a log. If it's meant for the person talking to the agent, it's output. Mixing them is a common early mistake.
Setting up the logger
A few lines, once, near the top of the harness:
import logging
logging.basicConfig(
level=os.environ.get("LOG_LEVEL", "INFO"),
format="%(asctime)s %(levelname)-7s %(name)s: %(message)s",
datefmt="%H:%M:%S",
)
log = logging.getLogger("shopbot")
# Keep the HTTP libraries quiet so our own logs stay readable, even at DEBUG.
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("httpcore").setLevel(logging.WARNING)
Two things worth noticing:
LOG_LEVELcomes from the environment (so from your.env). You can switch betweenINFOandDEBUGwithout touching code.- We quiet
httpx/httpcore— the HTTP libraries are chatty atDEBUG, and their noise would bury our own messages. You almost always want third-party loggers turned down.
What to log, and at which level
The level is the message — it says how much you care. ShopBot logs:
| Event | Level | Why |
|---|---|---|
| Round started | INFO | the normal heartbeat of the loop |
| Tool call + arguments | INFO | the single most useful line for debugging |
| Tool result | DEBUG | useful, but verbose — only when digging in |
Model's stop_reason | DEBUG | detail for understanding a decision |
| Sensitive action declined | WARNING | someone said no — worth noticing |
| Tool failed | ERROR | something actually went wrong |
| Final answer produced | INFO | the loop finished cleanly |
In code it's just the right call at the right spot — for example, in run_tool:
log.info("tool call: %s args=%s", name, arguments)
...
log.warning("tool %s declined by human", name)
...
log.debug("tool %s result: %s", name, result)
Turning the dial
At the default INFO, a normal run reads as a clean timeline:
14:33 INFO shopbot: round 1: asking the model
14:33 INFO shopbot: tool call: look_up_orders args={'customer': 'alex@example.com'}
14:33 INFO shopbot: round 2: asking the model
14:33 INFO shopbot: tool call: get_tracking_status args={'tracking_number': '1Z999'}
14:33 INFO shopbot: round 3: asking the model
14:33 INFO shopbot: final answer after 3 round(s)
Set LOG_LEVEL=DEBUG and the same run also shows the tool results and the
model's stop reasons — the extra detail you want only when something's wrong:
14:33 INFO shopbot: tool call: look_up_orders args={'customer': 'alex@example.com'}
14:33 DEBUG shopbot: tool look_up_orders result: [{'order_number': '4521', ...}]
14:33 DEBUG shopbot: model stop_reason=tool_use
One environment variable, two very different views — and no code change.
Why this matters in production
- Debugging — when a run goes wrong, the log is your replay of every decision.
- Auditing — for sensitive actions like refunds, you want a durable record of what happened and that a human approved it (ties straight back to permissions).
- Monitoring — counting
ERRORlines or slow rounds tells you the agent's health before customers complain.
Logs get stored, shipped to other systems, and read by lots of people. Don't log API keys, full customer records, or payment details. Log enough to debug — an order number, a tool name — not everything. This is a real source of breaches.
Beyond plain text, production systems often emit structured logs (JSON) into a log aggregator so they can be searched and alerted on — but the discipline is the same one you just applied: log the right events at the right level.
Recap
printis fine for a demo; production needs logging — levels, timestamps, and a toggle.- Separate the two audiences: the customer gets the final answer (
print), the operator gets the play-by-play (log). - Use levels to encode importance — routine steps at
INFO, detail atDEBUG, problems atWARNING/ERROR— and drive verbosity fromLOG_LEVEL. - Logging powers debugging, auditing, and monitoring — but never log secrets or personal data.
That rounds out a production-minded harness: it loops, uses validated and permissioned tools against a real backend, recovers from failures, and tells you what it did. Next we connect it all to live data.