Reliable Developer Tooling for Agentic Software Development

December 2025 Topics: Developer Tools, AI Agents, Local Development, DevOps

Abstract

The best AI-agent-based development workflows are those where the agent can not only write code, but verify the build, start the service, test its changes, and debug failures — all reliably and without human intervention. This paper describes developer tooling designed to make that possible across multi-service local development environments.

The difference between exploratory "vibe coding" and production-grade agentic development is reliability. When an agent encounters a port conflict, a stale binary, a failed migration, or a verbose build error, the quality of its response depends entirely on the quality of its tooling. Unreliable tooling leads to wasted context, incorrect assumptions, and cascading failures. Reliable tooling lets the agent focus on the actual problem.

This tooling provides deterministic service lifecycle management (build, run, stop, restart), content-based change detection that survives git operations, multi-database orchestration, and environment isolation via git worktrees. It is designed to work consistently across AI agents from different providers — Claude, Codex, Gemini — as well as human developers, through a dual CLI/TUI interface.

Starting point: For teams beginning to add reliable development patterns, a Makefile is the right first step — it is ubiquitous, well-understood, and immediately useful. As the number of services grows and requirements like change detection, database orchestration, and environment isolation emerge, the limitations of Make drive the need for purpose-built tooling. This paper describes what that tooling looks like.

1. The Problem

AI coding agents have become capable enough to implement features, fix bugs, and refactor code across large codebases. But writing code is only part of the development lifecycle. The agent also needs to build, run, test, and debug — and each of these operations is a source of failure that can derail an entire session.

What Goes Wrong

Consider what happens when an AI agent needs to start a service after making a code change:

Port conflicts - A previous instance is still bound to the port. The agent tries to start the service, gets "address already in use," and now spends multiple turns finding the process ID, killing it, and retrying — or worse, picks a different port, introducing environment drift
Orphaned processes - The agent killed the service but a child process still holds the port. Or the PID file is stale because the process crashed without cleanup
Build directory confusion - The build command runs from the wrong directory, or places the binary in an unexpected location. The agent runs a stale binary and draws wrong conclusions about whether its code change worked
Context pollution - The build succeeds but produces 200 lines of compiler output, warnings, and dependency resolution logs. This consumes valuable context window that the agent needs for reasoning about the actual task
Debug spirals - A service fails at runtime. The agent reads the full log file — thousands of lines — trying to find the error. Most of it is irrelevant startup output. The relevant error is buried, and the agent's context is now saturated with noise
Stale binaries across services - In a multi-service environment, the agent rebuilds one service but forgets another that also needs rebuilding. Requests fail at service boundaries, and the agent debugs the wrong service
Database state drift - Schema migrations haven't been applied, or seed data is missing. The agent debugging "why does this query return no results" when the real problem is that the table doesn't have the new column yet

Each of these problems is individually solvable. An experienced developer handles them intuitively. But an AI agent treats each one as a novel problem, consuming context and making exploratory decisions that compound into unreliable outcomes.

The Reliability Gap

The difference between exploratory coding and production-grade agentic development is that every SDLC operation — build, start, stop, test, debug, database reset — produces a predictable result. The tool either succeeds silently or fails with exactly the information needed to take corrective action. No ambiguity, no extraneous output, no environmental surprises.

2. From Makefiles to Purpose-Built Tooling

For teams adding reliable development patterns to a project, Makefiles are the right starting point. Make is ubiquitous, requires no installation, and immediately solves the most basic problem: "how do I build and run this?" A well-written Makefile with make build, make run, and make db-reload targets provides enormous value for both human developers and AI agents.

However, as a project scales to multiple services with shared dependencies, the limitations of Make become apparent:

No content-aware change detection - Make uses file modification times (mtime) to determine what needs rebuilding. Any git operation (checkout, rebase, cherry-pick) resets all timestamps, triggering unnecessary full rebuilds. Content-based hashing solves this but cannot be expressed in Make syntax
No service lifecycle management - Make can start a process but cannot reliably stop a previous instance, detect port conflicts, capture logs to a file, or verify successful startup. These require process management logic that Make was not designed for
No database orchestration - Managing schema migrations, seed data, and hash-based reload detection across multiple database types (PostgreSQL, DynamoDB, object storage) requires state tracking that Make targets cannot maintain
No environment isolation - Parallel development across git worktrees requires port offset calculation, database name isolation, and configuration coordination that crosses Make's complexity threshold

This tooling was built in Go as a purpose-built solution when these limitations became blocking. It is not a general-purpose build system — it is a fit-for-purpose tool designed for the specific requirements of multi-service local development with AI agent support.

3. Architecture Overview

+---------------------------------------------------------------+
|                      Developer Tools                           |
+---------------------------------------------------------------+
|                                                                |
|  +---------------+  +----------------+  +------------------+   |
|  | CLI Interface |  | TUI Interface  |  | Config System    |   |
|  | (AI Agents)   |  | (Developers)   |  | (3-level)        |   |
|  +---------------+  +----------------+  +------------------+   |
|                                                                |
|  +-----------------------------------------------------------+ |
|  |                    Core Modules                           | |
|  |  +-----------+  +-----------+  +----------+  +--------+   | |
|  |  | Service   |  | Database  |  | Build    |  |Worktree|   | |
|  |  | Lifecycle |  | Manager   |  | System   |  |Manager |   | |
|  |  +-----------+  +-----------+  +----------+  +--------+   | |
|  +-----------------------------------------------------------+ |
|                                                                |
|  +-----------------------------------------------------------+ |
|  |                  Infrastructure                           | |
|  |   Process Mgmt | Docker Containers | Git Operations       | |
|  +-----------------------------------------------------------+ |
+---------------------------------------------------------------+

Dual Interface Design

The same underlying operations are exposed through two interfaces designed for different users:

CLI (for AI agents): Minimal output on success (typically zero lines). Structured error output on failure. Designed for deterministic behavior — given the same state, the same command produces the same result. While external factors (network availability, resource contention) can introduce variability, the tooling controls what it can to minimize non-determinism and make outcomes as predictable as possible.

TUI (for developers): Live dashboard showing all service states, build status, database status, and git information. Hot-key shortcuts for common operations (build, run, restart, view logs). Real-time error monitoring across all services.

Platform support: The tooling currently supports macOS and Linux. Port detection uses lsof, process management uses POSIX signals, and shell operations assume a Unix-like environment. Windows is not supported at this time.

Environment Status

A single status command provides a complete view of the development environment — every service's run state, build freshness, database sync status, and git position:

$ dt status

Services:
  Service              URL                   PID     Binary       BUILD  DB   Git
  -------              ---                   ---     ------       -----  --   ---
  service-api          http://localhost:8090  42015   80M (2h)     OK     OK   main OK
  service-jobs         http://localhost:8091  42029   47M (2h)     OK     OK   main OK
  service-gateway      http://localhost:8092  42023   43M (2h)     OK     OK   main OK
  service-worker       -                     -       35M (2h)     OK     OK   main OK
  web-app              http://localhost:8081  42047   31M (2h)     OK     OK   main OK
  web-admin            http://localhost:8022  42044   46M (2h)     OK     OK   main OK
  web-home             http://localhost:8080  42051   30M (2h)     WARN   OK   main WARN

Databases:
  Database         Status      Port
  --------         ------      ----
  postgres         running     :5432
  dynamodb         running     :8000
  minio            running     :9000
  clickhouse       running     :9002

For an AI agent, this output answers several questions at once: which services need rebuilding (WARN), which are running, which databases are available, and whether any services have uncommitted changes. For a human developer joining the project, it shows the full state of the environment in seconds.

4. Service Lifecycle Management

The most failure-prone part of agentic development is service lifecycle management — the sequence of stopping an old instance, building a new one, starting it, and verifying it runs correctly. Each step has edge cases that, if handled incorrectly, lead to cascading confusion.

The Start Sequence

Start Service Request
    |
    v
[Check for existing instance]
    |-- Read PID from build/.pid
    |-- Is process still alive? (signal check)
    |-- If yes: send SIGTERM, wait up to 2s, then SIGKILL
    |
    v
[Check for orphaned port binding]
    |-- Query port via lsof
    |-- If occupied by different process: kill it
    |-- Wait for port release
    |
    v
[Build if needed]
    |-- Hash source files (SHA-256)
    |-- Compare against cached hashes
    |-- If changed: build, capture output to build.log
    |-- If unchanged: skip (report "build OK")
    |
    v
[Start new process]
    |-- Set environment variables (from .env.local + overrides)
    |-- Redirect stdout/stderr to service log file
    |-- Launch binary, record PID to build/.pid
    |-- Detach (background goroutine monitors for exit)
    |
    v
[Report result]
    |-- Success: service name, URL, PID (one line)
    |-- Failure: error message from build.log or startup

Key design decisions in this sequence:

PID-first, port-second cleanup - The tool first tries to stop the known process via its PID file. Only if that fails (stale PID, crashed process) does it fall back to port-based detection. This ordering minimizes the risk of killing unrelated processes. On shared machines, the port-based fallback could terminate an unrelated process that happens to hold the port — in such environments, adding ownership checks or requiring an explicit force flag would be prudent. In dedicated development environments, the aggressive cleanup is generally safe and preferred for reliability
Graceful then forceful shutdown - SIGTERM gives the service a configurable grace period (default: 2 seconds) to flush database connections and close files. If it doesn't respond, SIGKILL terminates it immediately. The default is tuned for local development where services are lightweight; production-oriented environments may increase this to accommodate heavier shutdown sequences
Log capture to file - All service output goes to a log file, not stdout. This is critical for AI agents: the agent sees only the start/stop result, not thousands of lines of runtime logs. When debugging is needed, the agent can request filtered log output separately
Environment from .env files - Service environment variables are loaded from .env.local (or .env.{environment}) files with optional per-service overrides. These files contain secrets and credentials and are excluded from source control via .gitignore. AI agents should respect this boundary and avoid including .env file contents in prompts or outputs sent to external services
Detached execution - The tool returns immediately after starting the service. A background monitor detects if the process exits unexpectedly and cleans up PID/port files

The Shutdown Sequence

Stopping a service follows a deterministic sequence: send SIGTERM, poll every 100ms for up to 2 seconds to check if the process exited, then send SIGKILL if it hasn't. After the process is gone, PID and port tracking files are cleaned up to prevent stale state on the next start.

5. Change Detection and Build Intelligence

Unnecessary rebuilds are a compounding cost in agentic development. Each rebuild takes time, produces output that may pollute context, and restarts services that may not need restarting. The goal is to rebuild only when source code has actually changed — not when git operations have reset file timestamps.

Content-Based Hashing

The build system computes SHA-256 checksums of all tracked source files and compares them against a cached state. Tracked files include application source code, templates, configuration files (go.mod, go.sum), and WASM build outputs. Generated files, vendored dependencies, and build artifacts are excluded.

The cache is stored per-service as a JSON file containing the hash of each individual file plus a combined hash for quick comparison. When any file's hash differs from the cached value, the tool reports exactly what changed (e.g., "go files changed", "template files changed", "wasm module changed") — giving the agent or developer a clear reason for the rebuild.

Build systems like Bazel and Buck have used content-based hashing for years. The approach here is simpler and more narrowly scoped: rather than a general-purpose build graph, it provides a fast "has anything changed?" check that integrates with the service lifecycle. The check runs in milliseconds for typical projects and, critically, produces correct results after git operations that reset all file timestamps — a scenario where Make's mtime-based approach triggers unnecessary full rebuilds.

Database Change Detection

The same hashing approach extends to database schemas and seed data. Migration files and seed SQL are hashed independently, so the tool can report whether a service needs a schema migration, a seed data reload, both, or neither. This prevents the common agentic failure mode of debugging query errors that are actually caused by an unapplied migration.

Parallel Builds

When multiple services need rebuilding (e.g., after pulling changes), builds execute in parallel with a configurable concurrency limit. Each build captures its output to a separate log file. If any build fails, the error is reported with the relevant compiler output only — not the output of all concurrent builds mixed together.

6. Environment Isolation

Parallel feature development requires isolated environments where changes to one feature don't interfere with another. Git worktrees provide code isolation, but services also need port and database isolation.

Port Isolation

Each worktree is assigned an index, and service ports are offset by worktreeIndex * 100. If the main worktree runs services on ports 8080-8099, worktree 1 uses 8180-8199, worktree 2 uses 8280-8299, and so on. This is deterministic — given a worktree name, the port assignments are always the same.

Services that expose multiple ports (e.g., HTTP, gRPC, debug/pprof) are assigned sequential ports within their chunk. For example, a service with three ports might use 8080, 8081, and 8082 in the main worktree, and 8180, 8181, and 8182 in worktree 1. The 100-port range per worktree provides ample room for multi-port services — the constraint is that the total number of ports used across all services in a worktree must fit within the allocated range.

Database Isolation

Database isolation uses naming conventions rather than separate containers, which is significantly faster to set up and tear down:

PostgreSQL: Each worktree gets a database name suffix (e.g., service_api_feature_auth)
DynamoDB: Table name prefix per worktree (e.g., feature_auth_users)
Object storage: Bucket name prefix per worktree (e.g., feature-auth-uploads)

All worktrees share the same database containers. Creating a new worktree takes seconds rather than the minutes required to spin up additional containers. When a worktree is deleted, all associated databases and tables are cleaned up automatically. This cleanup is scoped by namespace — only databases and tables matching the worktree's naming prefix are removed. Protected prefixes (e.g., the main worktree's databases) are never eligible for automatic deletion. For additional safety, the cleanup can be gated behind a confirmation prompt with a force-override flag for scripted environments.

7. Debugging and Context Protection

Debugging is where unreliable tooling causes the most damage to agentic workflows. A single verbose log dump can saturate an agent's context window, leaving it unable to reason effectively about subsequent steps.

Structured Log Access

Service logs are always written to files, never to the agent's stdout. When an agent needs to debug a failure, it requests filtered log output — the tool extracts error lines with a configurable number of context lines before and after each match. This produces a focused view of what went wrong without the thousands of lines of normal operation that surround it.

A background error monitor continuously watches all service logs for patterns indicating failure (errors, panics, fatal messages). When the agent or developer asks for errors, these are presented as structured blocks with timestamps and surrounding context — not as raw log streams.

Multi-Provider Consistency

We believe in a multi-provider, multi-agent world. These tools are designed to provide a consistent development experience regardless of which AI agent or model is being used — whether that is Claude, Codex, Gemini, or another. The CLI interface is model-agnostic: any agent that can execute shell commands can use these tools effectively.

The tool's instructions and behaviors can be defined once in a project's developer guide and work across all agents. This eliminates the need to maintain separate automation scripts for different AI tools and ensures that switching between agents doesn't require re-learning how the development environment works.

Sub-agent delegation: Agent systems that support sub-agent delegation (such as Claude Code's skills and sub-agent system) can take this a step further. Build and debug operations can be delegated to specialized sub-agents that execute the tool commands, interpret results, and report back concisely — protecting the main agent's context from build output and log noise. This delegation pattern works with any agent framework that supports spawning child agents, and is not tied to any specific provider.

8. Key Insights and Gotchas

Insights

The agent's ability to verify its own work is the bottleneck - Writing code is the easy part. The hard part is building, running, and confirming the change works as expected. Reliable tooling for this verification loop is what separates production-grade agentic development from exploratory coding
Start with Makefiles, graduate when you must - A Makefile covers 80% of the value for single-service projects. Custom tooling becomes necessary when the number of services, database types, and environment isolation requirements exceed what Make can express maintainably
Content hashing over mtime - File modification times are unreliable across git operations. Content-based hashing adds milliseconds of overhead but eliminates unnecessary rebuilds that waste minutes and confuse agents
Database-level isolation over container-level - Using naming conventions within shared containers is orders of magnitude faster for worktree creation than spinning up separate database containers per environment
Never build or restart what hasn't changed - Unnecessary operations don't just waste time — they compound. Every unnecessary restart produces output, resets state, and introduces opportunities for the agent to draw wrong conclusions about what changed
Protect context aggressively - The most important output property for AI agents is silence on success. The second most important is precision on failure: exactly the error, with just enough context, and nothing else

Gotchas

Port offset arithmetic must account for all services - If a project has 15 services and the offset is 100 per worktree, ensure no service's base port falls within another worktree's range
Worktree deletion must clean up all database artifacts - Forgetting to drop isolated databases or tables leaves orphaned data that can cause confusion in subsequent worktrees
Stale PID files after crashes - If a service crashes without cleanup, the PID file points to a dead process. The tool must check process liveness, not just file existence
Hash cache must invalidate when the binary is missing - If someone manually deletes the build output, the hash cache still says "up to date." The tool must check for the binary's existence as a precondition, not just hash state
Log file truncation on restart - By default, service logs are truncated when the service restarts. This is intentional: a fresh log file provides a clean starting point for each run, eliminating confusion from stale errors and reducing context size during AI-driven log analysis. The trade-off is that previous-run errors are lost — if an agent needs historical errors, it should capture them before triggering a restart. This behavior could be made configurable (e.g., log rotation keeping N previous files) for environments where historical log retention is preferred

References

GNU Make Manual. Free Software Foundation. https://www.gnu.org/software/make/manual/
Bazel Build System Documentation. Google. https://bazel.build/docs
Git Worktree Documentation. Git SCM. https://git-scm.com/docs/git-worktree
Docker Compose Specification. Docker Inc. https://docs.docker.com/compose/