strategy · engineering

Get the most value of your tokens with nearshore software development using Claude

An agentic pipeline cut delivery from 41 days to 12. The 55 production issues automated checks missed are the part senior engineers still own.

By Lagarsoft Engineering

Lines of thought

Here’s a thing that happens when you put an agentic pipeline in front of real production work: it goes fast, and then you find out what fast was hiding. Whether you’re building in-house or via nearshore software development, the pattern holds.

Summary

An agentic pipeline can compress delivery dramatically. In our first implementation we went from 41 days to 12, and then to 5, using shared guardrails and cadence.

But the speed surfaced 55 operational issues that automated checks missed (IAM scoping, silent failures, OOM/Aurora limits, Step Functions constraints, and data bugs), all diagnosed by humans.

The pipeline only improved after humans named failure modes and encoded checks, making senior review and agile ceremonies non-negotiable.

Bottom line: speed is buyable. Judgment about what is safe to ship still requires experienced engineers.

How we rebuilt how we ship

We rebuilt how we ship. The pipeline runs the normal agile ceremonies, planning, review, retro, except compressed to about a fifth of the calendar time. It’s built on the 12-factor agents principles, which is mostly a long argument for treating an LLM like an unreliable junior who needs tight contracts around every step, not a genius you hand the keys to. We apply the same discipline across our nearshore development services so distributed teams share the same checks, cadence, and contracts.

The first feature we ran through it had a pre-agent baseline of 41 calendar days. The agent version completed it in 12. That’s 3.4x.

The next one went 12 to 5.

The one after that held at 5 and stayed there.

We made a Gource video of one of these projects, before and after. The commits fly. It looks like the future.

Lagarsoft website v1 ~ 2019

Lagarsoft website v4 ~ 2026

It is not the future. It’s the easy part.

What the agent did, and what it couldn’t

On that first 41-to-12 graduation, the agent executed about 70% of the work. The rest was human-guided or human-diagnosed. Good ratio. The automated checks did their job too: ruff, pyright, pip-audit, terraform fmt all running on every loop, catching lint and security issues before a person ever looked.

Now the number that matters.

That graduation closed with 55 operational fixes. The automated checks caught zero of them. All 55 were diagnosed by a person.

Here’s the breakdown, because the categories are the point:

  • 12 were IAM roles scoped too wide.
  • 7 were out-of-memory and Aurora issues.
  • 5 were silent failures, the kind that return 200 and lose your data.
  • 5 were naming gaps in the graduation itself.
  • 4 were Step Functions constraints.
  • 5 were SQL and data-logic bugs.
  • The rest was module wiring and config drift.

Read that list again if you run production. None of it is the kind of thing a linter finds. All of it is the kind of thing that pages you at 2am four months after the demo went well.

The agent wrote the code fast. A senior engineer is the reason the fast code didn’t quietly leak credentials or drop rows.

The pipeline got better at catching itself. After we taught it what to catch.

The honest follow-up: it improved. By the later graduations the agent was catching most of its own issues in-loop, the higher-severity ones especially.

On one delta we logged 4 issues caught by the agent and 0 slipped.

On the big regression sweep, 23 caught, 1 slipped.

That looks like the agent getting smart. It’s not. It’s the agent getting good at checking for the failure modes a person already named. The silent partial-failures, the IAM path scoping, the result-checking states in Step Functions, somebody had to get burned by each of those first and write the check. The pipeline compounds human judgment. It doesn’t replace it. That improvement loop is exactly what we push in nearshore application development, where multiple teams benefit from shared checklists and pre-flight tests.

This is also why we kept the ceremonies instead of deleting them to go faster. The planning and review steps are where a senior looks at what the agent is about to do and says no. Take those out and you’ve built a very fast way to ship the wrong thing. It’s the same safeguard we recommend to our clients running parallel tracks.

Why this is the whole argument

There’s a version of software right now where a non-technical founder builds something in Cursor over a weekend, it demos beautifully, and it dies the first Friday real users touch it. We see it constantly. It’s most of the remediation work we do. It shows up in startups and in nearshore development company engagements alike.

The reflex is to blame the tool. The tool is fine. The 70% the agent executed on our pipeline was genuinely good code. The problem was never that the AI writes badly. The problem is that AI writes confidently, and confidence at 5x speed means you reach production faster, with more under the surface, and nobody senior decided whether any of it should ship.

So when we tell you we cut 41 days to 12, that’s true, and it’s the least important sentence in this post. We’d rather you remember the 55 fixes. Speed is buyable now. Anyone can buy it. The judgment about which fast code is safe to put in front of real load is the part that still has a person’s name on it. It’s why our nearshore software development services emphasize senior judgment.

That’s what we sell. Not the pipeline. What the pipeline can’t do alone.

Q&A

What do you mean by an “agentic pipeline,” and how did it cut delivery from 41 days to 12 (and then to 5)?

Short answer: it’s a development flow where an LLM acts like a tightly-managed engineer inside compressed agile ceremonies (planning, review, retro), running fast loops with strict contracts and guardrails. Built on “12-factor agents” principles, it treats the model as unreliable without checks, not as an autonomous expert. Paired with nearshore teams sharing the same cadence and controls, the first feature went from a 41-day baseline to 12 days (3.4x), with subsequent features stabilizing at about 5 days.

If automated checks were running, why did 55 operational issues slip through?

Short answer: linters and security scanners (ruff, pyright, pip-audit, terraform fmt) catch code hygiene and obvious risks, not production failure modes. The 55 fixes were things like overly broad IAM roles (12), out-of-memory/Aurora limits (7), silent 200-OK data losses (5), naming gaps in graduation (5), Step Functions constraints (4), SQL/data-logic bugs (5), plus wiring/config drift. These are the kinds of issues that page you months later and typically require human context to detect.

Did the agent actually “get smarter” over time at catching its own issues?

Short answer: not on its own. It improved after humans named the failure modes and encoded targeted checks. Later runs showed 4 issues caught with 0 slipping on one delta, and 23 caught with 1 slipping on a larger sweep, because the pipeline learned the checks people wrote (e.g., for silent partial failures, IAM path scoping, Step Functions result checks). The system compounds human judgment. It doesn’t replace it.

Why keep senior review and agile ceremonies if speed is the goal?

Short answer: because speed without judgment ships the wrong thing faster. Planning and review are where a senior engineer vetoes unsafe or misdirected work before it moves. Removing those checkpoints turns an efficient pipeline into a risk amplifier. This is doubly important in nearshore setups running parallel tracks, where shared guardrails and senior sign-off prevent fast, synchronized mistakes.

What’s the main takeaway for teams considering nearshore software development with agents?

Short answer: you can buy speed. You can’t buy judgment. Expect the same operational categories to surface until you name and guardrail them. Standardize checks and pre-flight tests across teams, treat the LLM like a junior with strict contracts, and make senior review non-negotiable. The value isn’t just the pipeline. It’s the experience deciding what’s actually safe to ship. If you want to see what that looks like on your codebase, book a call.

Tell us what you're building.

Book a call