---
title: "Testing and Linting — Keeping LLMs on the Rails"
date: "2026-05-26"
author: "Stefan Loesch"
summary: "Tests and linting are table stakes. But with LLMs, you need structural linting — architectural rules encoded as automated checks — to stop the codebase from drifting into chaos."
description: "Why LLM-assisted coding needs testing, linting, and especially structural linting to enforce architectural rules and prevent codebase drift."
keywords: "vibe coding, LLM coding, AI software development, testing, linting, structural linting, code quality, Claude coding, software architecture, TDD, integration testing"
tags: AI, Software Engineering, Vibe Coding
url: "https://aigon.ai/blog/2026-05-26-testing-linting-for-vibe-coding-llms/"
---

*This is Part 2 of the "Does Vibe Coding Work?" series. [Part 1: Does Vibe Coding work?](/blog/2026-05-14-does-vibe-coding-work/) discussed why complexity is the fundamental challenge in LLM-assisted programming.*

In [Part 1](/blog/2026-05-14-does-vibe-coding-work/), we discussed how LLMs are powerful programming tools — very fast, very knowledgeable. We also identified the issues that every LLM-project eventually runs into, one of them being what we called "Quadratic Complexity": Features interact with features, and therefore the effort of adding feature $N+1$ has $N$ potential interactions, so adding features $1\ldots N$ grows quadratically in the worst case. Also, and this compounds the problem, LLMs tend to have a strong local focus. By default they don't have a big-picture view, and they tend to go with _good-enough_ rather than with the optimal solution for any situation. As a consequence, LLMs are often breaking features that were not obviously connected, especially when the code gets bigger and more chaotic.

We introduced a number of remedies for this, and in this article we will focus on the first category identified -- *testing* and *linting*. Those two actions are complementary: testing aims to ensure that nothing is broken, and linting aims to ensure a coherent structure of the project, thereby making it easier for the LLM to operate in its code base.

## Unit and integration testing

Testing is nothing new of course -- every major codebase has test suites and it is essentially impossible to run a multi-contributor project without one. When programming with LLMs those become even more important because they can be very very negligent in terms of side effects, worse than any junior programmer I've ever encountered. And worst of all -- you cannot really teach them. They'll apologize, they'll tell you they won't do it again, and after the next compaction all this is forgotten and the same mistakes happen over and over again. **The more guardrails you can place the better they will perform**. So testing is essential infrastructure in AI-assisted programming.

**Unit testing.** Unit tests verify that individual functions and components work correctly in isolation. You call a function with known inputs, check that the output matches what you expect. They are fast, focused, and catch regressions early — if you change a function and its unit test breaks, you know immediately what went wrong and where.

**Integration testing.** Integration tests go a step further. They verify that components work together — that the data flowing from module A through module B to module C arrives intact and correctly transformed. Where unit tests check that each piece works on its own, integration tests check that the pieces still fit together. This is often where LLMs break things because they change the behaviour in one part of the code and this impacts what happens elsewhere.

There are two broad strategies for how you design test suites. The first is test-driven development — you invest heavily upfront in writing your tests, and then you build the code to make them pass. This approach works well with LLMs. Once the tests are written, you can literally hand them to the LLM and say "make these pass." The LLM builds the implementation, runs the tests, sees what fails, and iterates until everything is green. Tools like the `/goal` command in Claude Code are built for exactly this workflow — you set the goal, and the LLM keeps working until the tests pass.

If you're working in a spec-driven way — defining requirements upfront before writing code — this can be very powerful. The [BMAD method](https://docs.bmad-method.org/) is built around exactly this idea: rigorous specs first, then let AI agents execute against them. For teams in big IT with well-defined requirements, this approach can work extremely well.

But for the kind of development many of us are doing — building from the bottom up, experimenting, seeing what works — writing comprehensive tests upfront isn't always realistic. You're discovering the requirements as you go. So you need a second strategy.

This strategy is simpler: you have the LLM write the code first, get it working the way you want, and then tell the LLM to tie it down with tests. Now the LLM will not always pick the best possible tests and freeze technical features rather than business requirements. Which then often break when the code is extended or modified because implementation logic is tested, not business logic. In practice this is not actually a problem -- you (or the LLM) just has to be very careful to assert whether the test failure is of a technical nature, and therefore simply requires updating the test, or whether something genuinely broke.

We generally start with *signature tests* on large objects — tests that simply verify which methods and properties exist on a class or module. They are cheap to write, cheap to run, and yes, they need frequent updating. However they protect you from the LLM simply forgetting about some functionality when refactoring, which happens surprisingly often. And in interpreted languages like Python you may only run into this error on some obscure code path, leading to surprise failures that can be substantially delayed.

Also, test suites need to be fast — a few minutes at most. Ideally they run constantly or at least at every commit, so that the LLM does not go too far off-piste before being pulled back. If need be split the slow tests into a separate suite that runs less often, eg only for PRs or releases.

Also your test suite should not live in the CI/CD pipeline — at least not exclusively. This is too slow, and there is no good feedback loop between CI/CD-hosted tests and the LLM. Of course standard pre-release testing hygiene still applies _in addition to the LLM-focused testing framework_ so the main point here is that _you must have a quick-feedback-loop suite in place that the LLM can run autonomously_.

## Linting

**Linting proper.** Linting is the automated enforcement of code conventions — formatting, naming, import order, type annotations, line length. It's not glamorous, but it keeps the codebase consistent and predictable. The fewer surprises in how the code looks, the easier it is to read — for humans and for LLMs alike.

And as alluded to above, for LLMs, code consistency matters. A well-structured code base with clear naming conventions makes it easier for an LLM to navigate reliably. This can avoid inconsistencies, this can avoid the LLM reinventing the wheel -- and more generally pushes the _chaos-boundary_ beyond which the code base is too complex for LLMs to work in a bit out.

**Structural linting.** Where linting enforces how code *looks*, structural linting enforces how code is *organized* — the rules about how modules relate to each other, what's allowed to call what, and where certain kinds of logic are permitted to live. A standard linter will tell you that your function name should be `snake_case`. A structural linter will tell you that your module isn't allowed to import from another module's internals, or that database access must go through the ORM, or that all agent definitions must follow a specific file naming and registration pattern. These aren't style preferences — they're architectural rules encoded as automated checks.

For example, in our projects we enforce inter alia the following rules:

- **No magic numbers.** All numeric constants must live in centralized `constants.py` files. Exceptions for obvious things like minutes-per-hour, but anything else the linter flags. This forces the LLM to centralize configuration rather than scattering hard-coded values through the code, which invariably will end up being defined in multiple locations at once

- **No cross imports.** We define multiple Python services in a single repo using some shared libraries. We must ensure that services cannot import from each other's internals, otherwise we suddenly have cross-calling issues that are hard to debug, and instantiation of ghost-objects that put pressure on resources, notably memory.

- **No SQL TEXT commands.** We are using an ORM and for a number of reasons we want to ensure that SQL is always generated by the ORM. LLMs sometimes ignore that and sneak in a `TEXT` command. Structural linting catches that. We also have a number of other database-specific rules, eg in relation to how sessions are injected, and to ensure that database queries are always async

- **No Lazy Import.** Claude loves lazy import of Python dependencies. This has all kinds of issues for long-running services, eg increased memory pressure over time, or late failures. So our linter ensures they only happen where necessary eg to avoid circular imports.

- **No Silent Failures.** Claude loves soldiering on when things go wrong, and it creates fallbacks and fallbacks of fallbacks. Structural linting can avoid that -- we ensure that the code fails fast so corrective action can be taken where and when needed.

## Conclusion

Above we have provided a few examples of how we organise our projects to optimise AI-assisted coding. It is important to point out that this is in a way optimized for our specific setup and operating model. Notably that we are working on a just-see-how-this-feature-works-and-revert-if-need-be approach, ie we spend little time planning a feature in advance. We do not use BMAD internally for example.

So the exact testing and linting framework will strongly depend on the project. However, what remains the same is that those testing and linting guardrails must be very fast, and that the LLM can operate them independently so that it can operate in a loop until everything is at least self consistent.
