Everyone has opinions about AI development productivity. Few have data. We tracked our sprint metrics for 6 months — 3 months before structured AI adoption and 3 months after — across 8 projects and 6 engineers. Here are the raw numbers and what they mean.
Methodology
What we measured
- Story points delivered per two-week sprint (team velocity)
- Time per task by category (tracked via Toggl)
- Code review cycles per pull request
- Bug density per 1,000 lines of code (bugs found in QA and production)
- Token spend per sprint (AI API costs)
How we categorized tasks
- Implementation: Writing new features, CRUD operations, UI components, API endpoints
- Architecture: System design, database schema design, infrastructure decisions
- Debugging: Bug investigation, root cause analysis, fixes
- Testing: Writing and maintaining tests
- Documentation: API docs, README files, technical specs
- Code review: Reviewing pull requests from team members
Controls
- Same team composition across both periods (6 engineers: 2 senior, 3 mid, 1 junior)
- Similar project complexity (scored 1–5 by tech lead before sprint)
- Same clients and domains
- Sprint capacity held constant (no overtime in either period)
Tools used (after adoption)
- GitHub Copilot for inline completion
- Claude (via API) for code generation from specs
- Custom ADLC agents for structured generation
- GPT-4o for code review assistance
The Numbers
Overall velocity
| Period | Avg Story Points/Sprint | Change |
|---|---|---|
| Pre-AI (Month 1–3) | 47.2 | Baseline |
| Post-AI Month 1 | 52.8 | +12% |
| Post-AI Month 2 | 61.4 | +30% |
| Post-AI Month 3 | 64.7 | +37% |
The ramp-up is significant. Month 1 was largely learning — figuring out where AI helps, developing prompt patterns, and adjusting review processes. Gains accelerated as the team built muscle memory.
Velocity by task category
| Task Category | Pre-AI (hrs/point) | Post-AI (hrs/point) | Change |
|---|---|---|---|
| Implementation | 3.2 | 1.5 | -53% |
| Testing | 2.8 | 1.6 | -43% |
| Documentation | 1.5 | 0.6 | -60% |
| Code Review | 1.8 | 1.9 | +6% (slower) |
| Debugging | 2.4 | 2.3 | -4% (negligible) |
| Architecture | 3.5 | 3.4 | -3% (negligible) |
The pattern is clear: AI dramatically speeds up generation tasks (implementation, testing, documentation) but provides no meaningful help with judgment tasks (debugging, architecture). Code review actually got slightly slower because reviewers needed to scrutinize AI-generated code more carefully.
Code review cycles
| Period | Avg Review Cycles per PR |
|---|---|
| Pre-AI | 3.1 |
| Post-AI Month 1 | 2.8 |
| Post-AI Month 2 | 1.4 |
| Post-AI Month 3 | 1.1 |
This was the largest unexpected gain. Once we implemented the Review Agent (ADLC Agent 5), surface-level issues were caught before human review. Human reviewers focused on logic and architecture, which meant fewer "fix the formatting" back-and-forths.
Bug density
| Period | Bugs per 1,000 LOC (QA) | Bugs per 1,000 LOC (Production) |
|---|---|---|
| Pre-AI | 4.2 | 0.8 |
| Post-AI Month 1 | 5.1 | 1.2 |
| Post-AI Month 2 | 3.8 | 0.7 |
| Post-AI Month 3 | 3.1 | 0.5 |
Month 1 saw increased bugs — expected when introducing any new tool. Engineers were accepting AI output without sufficient review. After tightening the review process, bug density dropped below pre-AI levels. The Review Agent catches pattern-based bugs that humans sometimes miss.
Token spend
| Month | Total Token Spend | Cost per Story Point |
|---|---|---|
| Post-AI Month 1 | $847 | $16.03 |
| Post-AI Month 2 | $412 | $6.71 |
| Post-AI Month 3 | $298 | $4.61 |
Token spend dropped 65% from Month 1 to Month 3. The driver: structured prompting through ADLC eliminated redundant context-loading. Instead of each engineer explaining the project from scratch in every prompt, our agents pre-load standardized context once.
What Improved and Why
Implementation tasks: 53% faster
The biggest gains came from:
- Boilerplate generation: API endpoints, database migrations, type definitions, form components
- Pattern repetition: When the codebase has an established pattern, AI replicates it accurately
- Test data: Generating fixtures, factories, and mock data
- Initial implementation: First draft of a feature generated in minutes, refined in the review
Testing: 43% faster
- Test scaffolding generated from specs (what to test, not just how)
- Edge case generation — AI is good at identifying boundary conditions
- Mock and fixture generation
- Integration test boilerplate
Documentation: 60% faster
- API documentation generated from actual endpoints
- Code comments for complex functions
- README sections
- Changelog entries from git history
What Did Not Improve
Debugging: 4% (negligible)
We tried using AI for debugging. Results were poor. Debugging requires:
- Understanding of the specific system's behavior
- Ability to reproduce and isolate issues
- Knowledge of recent changes and their side effects
- Intuition built from experience with the codebase
AI tools suggested plausible-sounding fixes that were wrong 70% of the time for non-trivial bugs. Engineers wasted time evaluating AI suggestions instead of applying their own debugging methodology.
Conclusion: Do not use AI for debugging complex issues. Use it only for obvious pattern-based bugs (null checks, off-by-one errors, missing awaits).
Architecture: 3% (negligible)
AI cannot make architecture decisions for your specific context. It does not know your scale requirements, team capabilities, deployment constraints, or business priorities. We stopped trying to use AI for architecture after Month 1.
Code review: 6% slower
Reviewing AI-generated code takes slightly more time than reviewing human-written code. Why:
- AI-generated code can be subtly wrong in ways that look correct
- Pattern matches might miss context-specific requirements
- Review cannot rely on "I know this engineer's patterns" heuristic
We offset this by having the Review Agent catch surface issues first, so human review time is spent on substance rather than style.
Lessons Learned
Lesson 1: The gains compound over time
Month 1 showed modest improvement (12%). Month 3 showed significant improvement (37%). The difference is not better tools — it is better processes around the tools. Teams need time to develop AI-augmented workflows.
Lesson 2: Trust calibration is critical
In Month 1, engineers alternated between over-trusting AI (accepting output without review) and under-trusting it (rewriting everything). By Month 3, they had calibrated — they knew which tasks to trust AI on and which required heavy verification.
Lesson 3: Junior engineers need more structure
Our junior engineer showed the smallest gains initially and the largest increase in introduced bugs. Without experience to evaluate AI output, they accepted incorrect code more often. We added mandatory senior review for all AI-assisted work from junior team members.
Lesson 4: The cost curve inverts quickly
Month 1 token spend: $847. Month 3: $298. Structured AI use is dramatically cheaper than ad-hoc use. The ADLC framework's standardized context loading was the primary driver.
Lesson 5: Measure by task type, not overall
"37% faster" is the headline number, but it obscures the reality: some tasks are 60% faster and some are unchanged. Knowing which is which lets you allocate AI assistance where it actually helps.
What This Means for Your Team
If you are considering structured AI adoption:
- Expect a 1–2 month ramp-up before seeing significant gains
- Budget for initial quality dips — have stronger review processes ready
- Track metrics by task category — overall velocity hides the real picture
- Invest in process, not just tools — the framework around AI matters more than which AI model you use
- Start with implementation and testing — highest ROI, lowest risk
The 37% number is real, but it took three months of intentional process development to achieve. Teams expecting instant transformation will be disappointed. Teams willing to invest in structured AI workflows will see compound returns.