Skip to main content

Logging and observability

ShopBot now does a lot behind the scenes: it loops, calls tools, validates arguments, asks for approvals, hits a real API, recovers from errors. When it works, great. But when it misbehaves in production — answers oddly, calls the wrong tool, refunds too much — you need to answer one question: what did it actually do, and why? That's observability, and the first step is logging.

Up to now the harness used print() to show what it was doing. That's perfect for a tutorial, but real systems need more:

  • levels — separate routine info from genuine problems,
  • timestamps — know when each step happened,
  • a switch — turn detail up when debugging, down in normal operation,
  • routing — send logs to a file or a monitoring service, not just the screen.

Python's built-in logging module gives all of that. So we replace the harness's internal print()s with proper logs.

Two audiences: the customer and the operator

There's a clean line to draw first. Two very different readers care about a run:

  • The customer sees one thing: the final answer. That stays a print (it's the actual product output).
  • You, the operator, need the play-by-play: every round, tool call, result, and error. That becomes logging.

Keeping them separate means you can crank up logging detail without spamming the customer, and vice versa.

Logs are for you, not the user

A good rule: if a line is meant for the person running the agent, it's a log. If it's meant for the person talking to the agent, it's output. Mixing them is a common early mistake.

Setting up the logger

A few lines, once, near the top of the harness:

import logging

logging.basicConfig(
level=os.environ.get("LOG_LEVEL", "INFO"),
format="%(asctime)s %(levelname)-7s %(name)s: %(message)s",
datefmt="%H:%M:%S",
)
log = logging.getLogger("shopbot")

# Keep the HTTP libraries quiet so our own logs stay readable, even at DEBUG.
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("httpcore").setLevel(logging.WARNING)

Two things worth noticing:

  • LOG_LEVEL comes from the environment (so from your .env). You can switch between INFO and DEBUG without touching code.
  • We quiet httpx/httpcore — the HTTP libraries are chatty at DEBUG, and their noise would bury our own messages. You almost always want third-party loggers turned down.

What to log, and at which level

The level is the message — it says how much you care. ShopBot logs:

EventLevelWhy
Round startedINFOthe normal heartbeat of the loop
Tool call + argumentsINFOthe single most useful line for debugging
Tool resultDEBUGuseful, but verbose — only when digging in
Model's stop_reasonDEBUGdetail for understanding a decision
Sensitive action declinedWARNINGsomeone said no — worth noticing
Tool failedERRORsomething actually went wrong
Final answer producedINFOthe loop finished cleanly

In code it's just the right call at the right spot — for example, in run_tool:

log.info("tool call: %s args=%s", name, arguments)
...
log.warning("tool %s declined by human", name)
...
log.debug("tool %s result: %s", name, result)

Turning the dial

At the default INFO, a normal run reads as a clean timeline:

14:33 INFO shopbot: round 1: asking the model
14:33 INFO shopbot: tool call: look_up_orders args={'customer': 'alex@example.com'}
14:33 INFO shopbot: round 2: asking the model
14:33 INFO shopbot: tool call: get_tracking_status args={'tracking_number': '1Z999'}
14:33 INFO shopbot: round 3: asking the model
14:33 INFO shopbot: final answer after 3 round(s)

Set LOG_LEVEL=DEBUG and the same run also shows the tool results and the model's stop reasons — the extra detail you want only when something's wrong:

14:33 INFO shopbot: tool call: look_up_orders args={'customer': 'alex@example.com'}
14:33 DEBUG shopbot: tool look_up_orders result: [{'order_number': '4521', ...}]
14:33 DEBUG shopbot: model stop_reason=tool_use

One environment variable, two very different views — and no code change.

Why this matters in production

  • Debugging — when a run goes wrong, the log is your replay of every decision.
  • Auditing — for sensitive actions like refunds, you want a durable record of what happened and that a human approved it (ties straight back to permissions).
  • Monitoring — counting ERROR lines or slow rounds tells you the agent's health before customers complain.
Never log secrets or personal data

Logs get stored, shipped to other systems, and read by lots of people. Don't log API keys, full customer records, or payment details. Log enough to debug — an order number, a tool name — not everything. This is a real source of breaches.

Beyond plain text, production systems often emit structured logs (JSON) into a log aggregator so they can be searched and alerted on — but the discipline is the same one you just applied: log the right events at the right level.

Recap

  • print is fine for a demo; production needs logging — levels, timestamps, and a toggle.
  • Separate the two audiences: the customer gets the final answer (print), the operator gets the play-by-play (log).
  • Use levels to encode importance — routine steps at INFO, detail at DEBUG, problems at WARNING/ERROR — and drive verbosity from LOG_LEVEL.
  • Logging powers debugging, auditing, and monitoring — but never log secrets or personal data.

That rounds out a production-minded harness: it loops, uses validated and permissioned tools against a real backend, recovers from failures, and tells you what it did. Next we connect it all to live data.