AI-First Coding: Closing the Gap Between Skeptics and Practitioners in Dev Teams

Speaker: Evgeny Potapov, ApexData co-founder & CEO

I started my last company in 2008. We were two people. By 2022 we were 150. The thing about scaling like that is that until recently, doing more development meant hiring more developers. I'm half a manager and half an engineer — and at 150 people you don't code anymore. You troubleshoot people problems.

When I started ApexData, I didn't want 150 people again. I wanted a small team that could stay well-connected, have fun, and let me keep doing tech work. So we started coding, and I quickly saw that AI was a real leverage point for productivity per person — instead of hiring more and more developers, you scale the people you already have.

We're trying to build this company AI-first. I try to help my friends. We run a small Slack group where we share what we learn. And one of the biggest problems I see in many companies right now is this: in a team of 10, maybe 30% are doing daily AI coding, and the rest have either tried it once, been burned, and gave up — or never started. This talk is about closing that gap.

AI tools in the development process — 84% of respondents are using or planning to use AI tools

A lot of people now think that 90–100% of software developers use AI tools every day. The actual numbers from Stack Overflow's late-2025 developer survey are different.

Stack Overflow survey: AI tool usage by frequency

Only 47% use AI tools daily. 17% use them weekly, 13% monthly, and 16% don't use them and don't plan to. So half the working developer population is either occasional or absent.

Why? In most cases the reason is the same: they tried AI coding in late 2022, or in 2023, or in early 2024 — got hallucinations, got burned, and never came back. But 2026 is a different world. Before talking about how to onboard those people, I want to walk through how we actually got here.

Table of contents: What's different now, Is AI coding reliable, Concerns, How to share, How not to share

Here's the plan. First, what changed and why. Second, whether AI coding is actually reliable now. Third, the concerns people raise in practice. Fourth, how to share AI-coding adoption inside a team — and how not to do it.

What's different now

Three years of practical work, four distinct eras.

Evolution of Transformer Models: 2017 Attention Is All You Need, 2018 GPT, 2019 GPT-2, 2020 GPT-3, 2021 Codex and GitHub Copilot

Everything started with a single paper, "Attention Is All You Need," in 2017. OpenAI built GPT on top of it as an API, then figured out by 2021 that the same architecture, fine-tuned on code, was a business. They released Codex, gave the API to GitHub, and GitHub shipped Copilot.

Stone Age: GitHub Copilot announced — June 2021

Guillermo Rauch on GitHub Copilot: 'GitHub Copilot is a code synthesizer, not a search engine'

Guillermo Rauch, the founder of Vercel, called Copilot "the future" on the day it launched. At the time it felt like magic — code that auto-completes itself. Looking back, that was the Stone Age.

Bronze Age: ChatGPT announced — November 2022

Early ChatGPT response identifying a never-closed Go channel bug

November 2022, OpenAI shipped ChatGPT. One of the headline use cases was bug-finding: paste 10 lines of code, ask why it isn't working, get a fix. People were stunned. People were also exhausted by the hype, and most of the time the answers were unreliable — too many hallucinations, not enough reasoning.

Salvatore Sanfilippo (antirez) tweet from Dec 2023 on LLM failure at system programming

By the end of 2023, Salvatore Sanfilippo — the creator of Redis — wrote that the level of LLM failure at systems programming was incredible, and that systems work probably needed advanced reasoning that the models just didn't have. Remember this quote. It comes back later.

Modern Era: Reasoning Models — Claude 3.5 Sonnet, June 2024

June 2024, Anthropic shipped Claude 3.5 Sonnet — the first reasoning model that landed in coding tools. Reasoning is the simple idea that changes everything.

Pre-reasoning model: prompt goes straight to output, mistakes included

Pre-reasoning, the LLM took your prompt and started predicting the next token immediately. For natural text that's fine. For code, you get a regex that's almost right and quietly wrong — missing characters, broken patterns, the whole problem.

Reasoning inverts the flow. The model first thinks about what you asked, plans the steps, then writes the code. Hallucinations don't disappear — they probably never will — but a reasoning loop catches a lot of them before they hit you. One month after Claude 3.5 Sonnet, Cursor adopted it, and that's the moment most people heard of Cursor.

Postmodern Era: Agents are models using tools in a loop — © Anthropic

Then came the agents. Anthropic's definition is the cleanest one: "Agents are models using tools in a loop." Reasoning is step one; tools — file reads, edits, shell commands, test runs — are step two.

Postmodern Era: Claude Code — Feb 24, 2025

February 2025: Claude Code. Now the developer types a task, not a snippet. Let's walk through what happens.

Agentic Mode: User prompt and understanding & planning

Agentic Mode: Tool call — explore codebase using list_directory and read_file

Agentic Mode: Reasoning and decision based on tool results

Agentic Mode: Final summary of completed task and modified files

"Add input validation to all API endpoints." The agent plans the work, calls list_directory to find the endpoints, reads the files, decides how to extend the existing validator, edits the code, runs the tests, and reports back. That's what makes 2026 different from 2024. The 2024 experience was "type a snippet, get a snippet back, hope it compiles." The 2026 experience is "describe a task, the agent does the task."

What concerns do people have?

So, if AI coding is in such a different place now, why aren't all 100% of developers on board? Three concerns come up over and over.

Three concerns: lack of trust, security concerns, coding is not fun anymore

Lack of trust. Security. And — the one that's most personal and hardest to solve — coding isn't fun anymore.

Is AI coding that reliable? A lot of people got their AI coding experience in 2024 and didn't like it

Is AI coding reliable now?

The trust problem is mostly an outdated impression. Most skeptics formed their opinion in 2024 or early 2025, on models that were genuinely not good enough. The right answer is to look at benchmarks over time.

SWE-bench progress over time: GPT-4 baseline 2%, GPT-4o 21.6%, Claude 4.5 Opus 74.4%

On SWE-bench, GPT-4o sat around 21% task resolution. Claude 4.5 Opus, Gemini 3 Pro, and GPT-5 / Claude 4 Opus are all in the 66–74% range now. That's not a 10% improvement — it's roughly 3–4x. So when someone says "I tried it, it was bad," the right follow-up is: which model, and when?

LiveCodeBench progress: GPT-4o / Claude 3.5 ~35%, o4-mini / DeepSeek-R1 ~80%

Same story on LiveCodeBench. ~35% → ~55% → ~70% → ~80%. The previous generation is a previous era. Anyone whose opinion is based on tools from a year ago needs to retry.

Salvatore Sanfilippo's 2023 tweet about LLMs being almost useless at system programming

Antirez 2026 update: 'for most projects, writing the code yourself is no longer sensible'

Here's the same Sanfilippo I quoted earlier — the man who in December 2023 called LLMs at systems programming "incredible failures." In a recent post, working on Redis-level code, he wrote that for most projects, writing the code yourself is no longer sensible. Redis is low-level software. If reasoning agents work for Redis, they work for almost anything. That's the kind of update I show to skeptics.

Adam Wathan (Tailwind): 'I used to be more skeptical than I am now... it's still not faster to do it yourself'

Adam Wathan, the author of Tailwind CSS, is another good example. He said publicly that he used to be more skeptical, that he assumed it would be faster to write the code himself, and that he was wrong — agent mode actually lets him program more, not less. Find the people whose past skepticism matches your colleague's, then show them the same person changing their mind.

LLMs are still hallucinating and will always be, but there are solutions

None of this means the hallucination problem is solved. It's not, and probably never will be. But there are now layers of mitigation that didn't exist a year ago.

Claude Code Skills / Superpowers plugin — github.com/obra/superpowers

Superpowers skill: receiving-code-review SKILL.md showing verify-before-implementing pattern

Detailed Code Review Reception skill content

Claude Code recently added skills — prepared prompts you load on demand. The open-source "Superpowers" plugin is a great example. It's a library of skills for brainstorming, planning, executing, security review, code review, and more. Each skill is a long, carefully written markdown file: guardrails that tell the model how to check itself, what to verify, what not to skip.

Development workflow: planning → design → TDD setup → develop task → run tests → spec review → feedback loop → code review → next task

With skills loaded, the flow is no longer "ask, get code, hope." It's plan → design → TDD setup → implement → run tests → spec review → code review → next task. If tests fail, loop back. If review fails, loop back. In 60–70% of my runs the loop goes back at least once. The end result, after all those re-checks, is dramatically better than the equivalent single-shot output.

JustHTML — BeautifulSoup replacement with 100% coverage on 9,200 browser HTML tests

Concrete example: someone built JustHTML, a pure-Python HTML parser, by feeding the 9,200 browser HTML rendering tests to an agent and letting it iterate for about a week. It now hits 100% on those tests. Chromium is at 99%, WebKit at 98%, Firefox at 97%, BeautifulSoup at 4%. That kind of result is only possible because of the verify-then-fix loop.

Security concerns

Security comes in two flavours: who's training on your code, and whether the code the AI writes is safe.

OpenAI Enterprise Plan compliance: ownership, control, security guarantees

For training and compliance, OpenAI's enterprise plan guarantees your data isn't used for training by default, gives you data-retention controls, regional hosting, and SAML SSO.

Anthropic compliance: Claude for Work Enterprise, AWS Bedrock, Google Vertex AI

Anthropic has Claude for Work Enterprise with the same kind of controls, and you can also consume Anthropic models via AWS Bedrock or Google Vertex AI — both give you region-specific hosting and a compliance posture that's easier to defend to a security team than "we're talking to api.anthropic.com from a laptop." Worth knowing: Google's own coding teams use Claude Code. Gemini is great for research (1M-token context), but Claude Code is the day-to-day tool inside Google. Hard to top that as a reference.

Local modes — open-source models on SWE-bench: Lingxi v1.5 71.2%, OpenHands+Qwen3-Coder 69.6%, GLM-4.6 68.2%

If a regulator says "no cloud models," local models are now viable. GLM, Qwen3-Coder, Kimi K2 variants sit in the high 60s to low 70s on SWE-bench. That's roughly where the cloud frontier was a year ago. Tools like Claude Code Router let you point Claude Code at a self-hosted model. Qwen CLI is open source. The cloud frontier is still ahead, but local is no longer toy-grade.

Claude Code Security Reviewer — AI-powered security review GitHub Action

Security concerns — Code implementation: code review still needed, AI review works great when done by a security expert, people make mistakes too

The other half of security is the code itself. More AI-generated code means more code to review, faster than humans can review it. Generic "do a security review" prompts produce generic results. What actually works: pair the security engineer with the tool. They know the threat model; they can write the specific checks; the LLM applies them at scale. And remember — humans also write insecure code. The right comparison isn't AI vs. a perfect reviewer, it's AI-augmented review vs. the review you actually do today.

Coding is not fun anymore — now, it's a real problem

Coding isn't fun anymore

This one is the hardest. People got into software development for one of two reasons: they love to build, or they love to write code. Those are different motivations, and the AI shift hits them very differently.

You love to build — the fun is still there, untouched

If you're a builder, this is a golden age. JustHTML was a side project — one person, evenings, one week. I rewrote the kitty terminal's macOS UX to match Ghostty's in about three hours: "take this kitty code in C/Python, take this Ghostty code in Zig, port the UX patterns over." I built a personal Todoist clone in a Sunday. Open source becomes pliable — you grab a tool, you tailor it. If what you love is shipping things, the fun is still there. There's just more of it.

You love to write code — the challenge of 2026: shift to mission-critical projects

If what you love is the act of writing code, the problem is real. Probably 30% of developers fall into this bucket. The current best answer is to move them to mission-critical work where hand-written code still matters — billing, safety-critical systems, low-level core paths. It's a partial answer. There isn't a complete one yet, and the industry will have to figure it out. Be honest about it when you talk to your team.

How to share the adoption — and why mandating doesn't work

How to share AI-coding adoption

The most common mistake managers make: "We'll buy 100 Cursor licenses, everyone must use Cursor by Q2."

AMS Review article: Consequences of mandated usage of innovations in organizations (2020)

Mandated enterprise innovations face ~50% implementation failure rates due to employee opposition

There's a well-known body of research on this (an AMS Review paper from 2020 is a good entry point): roughly half of mandated enterprise innovations fail outright. Humans resist tools that arrive with a deadline and no story.

I know of a real example — a product company, about 100 developers, bought Cursor for everyone. 30 were already using it, 40 picked it up, the other 30 didn't have a path in and didn't move. Three months later, performance reviews fired those 30 in a single hour and the company replaced them with people who already used AI tools. Beyond the cruelty, this was strategically bad: the 70 who remained now know the company will fire you for not adopting fast enough. Trust collapses. "Humans can replace humans" is a much scarier message to your team than "AI is replacing humans," because the first one is plausible and the second one isn't.

The Champion Model — Phase 1: The Sandbox

The Champion Model — Phase 1 and Phase 2 (The Expansion)

The Champion Model — full three phases including holdouts

What works better is the Champion Model — three phases, not one mandate.

Phase 1, the Sandbox: a small Tiger Team of senior developers who already want to do this. Their job is to break the tool, find what doesn't work, and write down the guardrails. Phase 2, the Expansion: roll out to teams with similar stacks, with the Tiger Team as mentors. Phase 3, the Holdouts: only after value is clear, work with resistant teams individually, with training that addresses their specific objections — security, fun, control, whatever it is. By Phase 3 you're not selling AI coding; the people next to them already use it daily, and the social pressure does the rest.

Sabbatical: Zach Wills LinkedIn post — shut down engineering for a week, 70 engineers learning AI software development

One ambitious version of Phase 1 + Phase 2: the sabbatical. Zach Wills posted about shutting down their entire 70-person engineering department for a week. No tickets, no feature work. Everyone learning AI tools. By the end of the week, mobile-naive engineers were shipping mobile apps; one developer built a working CRM (backend, frontend, auth) before lunch. Most companies can't or won't pause for a week — but if you can, the learning compounds faster than you'd believe, because the day-to-day pressure is the real blocker. Nobody has time to learn a new tool when the backlog won't stop.

Case sharing — Slack channel community to share AI coding tricks across three phases

A lighter-weight version of the same idea: a cross-company Slack channel. We run one with friends from different companies. People drop in tricks, gotchas, prompt patterns, new tools. Yesterday someone there asked whether they could really get Claude Code to configure BGP networking on a Linux server. The answer was: yes, give it SSH access in a non-prod environment and describe what you want. He resisted for half an hour — "I want to learn it myself" — and a few hours later messaged back that he was now doing all his ops with Claude Code. He's a strong engineer. The channel did the work of convincing him; we just gave him the room.

AI-Coding Framework Implementation: cursorrules / CLAUDE.md, documentation, skills/MCPs, session history library

Practical implementation framework

Four pieces of plumbing make all of this easier:

A CLAUDE.md (or .cursorrules) per project. Encode your conventions: which language version, which package manager ("always use uv for Python"), which branch hygiene, which directories to ignore. The fewer constraints the model has to guess, the fewer hallucinations you get.
Documentation that the agent can read. Convert vendor PDFs to markdown, point the agent at them. I once gave Claude Code the Cilium eBPF book as markdown and asked it to use the relevant chapters as the design reference for my own eBPF agent. It works.
Skills and the right MCP. Superpowers for the workflow. Context7 as the MCP for live library docs — it pulls authoritative documentation for ~10,000 libraries on demand, instead of letting the model guess. I deliberately keep my MCP list short: context7 plus whatever I've pre-converted to markdown. The fewer pre-installed MCPs, the cleaner the context.
A library of session history. Senior people develop tricks. Capture sessions — record them, or just save the transcripts — and use them as teaching material. It's the AI-coding equivalent of pairing with a senior engineer.

Problems: 30% won't switch, pace of change is insane, PRs growing in size, prompt engineering takes time

What's still unsolved

This is honest list time.

Roughly 30% of developers don't want to switch. Not because of trust, but because the thing they loved is being taken away. There's no clean answer yet.
The pace of change is brutal. Claude Code Skills landed in autumn 2025. Claude Code itself launched in early 2025. Opus 4.5 is roughly 4x stronger than the leading model from 18 months earlier. You can't sit still — last year's playbook isn't this year's playbook.
PRs are getting bigger. When the agent ships more code, review queues grow. The reviewer becomes the bottleneck. We don't have a great answer to this yet beyond "use AI for first-pass review" — which is itself an open problem.
Prompt engineering is a real skill. Around 70% of people struggle to formulate what they need precisely. That's why brainstorming skills exist — let the agent ask you the clarifying questions before it writes anything. But the underlying ability — describing requirements crisply — is real management craft, and not every developer has it yet. It will need to be trained.

The future / The challenges: end-to-end features, parallel development, multi-service AI-built apps, junior engineer problem

What 2026 brings

End-to-end feature delivery. Whole features, including multi-service applications, built almost entirely by agents. DevOps people will have feelings about this.
Parallel development. Run Claude Code in YOLO mode inside a sandboxed VM with a well-scoped task, and it'll work for hours. I already do this across separate projects. Doing it inside one project (multiple tasks at once on the same codebase) is the next step. It feels insane the first time it works.
The junior engineer problem. If juniors don't write code, they don't build the mental model of why code behaves the way it does — and without that mental model, they can't supervise an agent effectively. The industry doesn't have an answer yet for how juniors learn in an AI-first world.

Q&A highlights

On refactoring large legacy codebases. Generic "do a code review" prompts are weak. Pointing at specific files or directories with explicit duplication concerns ("these three files have the same loading logic; consolidate") works much better. Indexing the whole repo with embeddings or RAG isn't something I rely on — I let the agent's own tooling search the codebase, and I scope it manually when it matters.

On English as a programming language. Yes, in a sense. Someone has already used Claude Code to design a new programming language end to end (with tests and an interpreter). That doesn't mean English replaces code — it means natural language is becoming the highest level of an existing stack.

On hiring when AI use is uneven. If your team adopts AI and one person doesn't, eventually they can't keep up. That's real. The answer isn't to fire them — it's to bring them in gradually, on their terms, with people they trust. Mass-firing 30% of your team to make a point is a guaranteed way to lose the other 70% to attrition.

The pace will keep accelerating. Pair the people who already do this with the ones who don't, give them air cover, write down what works, and re-evaluate every quarter. Anyone whose last try was a year ago is making decisions based on a previous era.