← Back to Blog

Vibe Coding Clinical Tools With AI: What Worked, What Failed, and What I’d Do Differently

• By John Britton

I’ve always been tech-forward, but I never took the time to fully learn a programming language. So when AI started getting meaningfully better at writing code, I jumped at the chance to use it as a shortcut. Not to become a developer, but to finally build things I'd been thinking about using AI-assisted programming, also known as "vibe coding" (writing software even without formal coding experience). This is what I learned building LocalScribe, a local AI documentation tool for clinicians.

Along the way, I learned how to work with these tools in a way that made them genuinely useful rather than frustrating.

Why AI Seduction Is Dangerous

When I started spending serious time vibe coding in mid-2025, I didn’t realize how sycophantic most coding models are. Outside a basic chat interface, in tools like Cursor or Windsurf, it’s easy to get swept up by confident explanations, clean summaries of what was “accomplished,” and a wall of green checkmarks. Then you actually run the code and discover it either doesn’t work at all or behaves nothing like what was described.

As a non-programmer, learning how to trust models turned out to be very different from learning how to use them. I had to slow way down. If I couldn’t demonstrate something through actual behavior, a visible UI change, or a simple test I understood, I stopped believing the explanation, no matter how confident it sounded.

Open Source Foundations

That lesson hit hardest when I started working on a project to de-identify clinical notes and reports. I didn’t anticipate how much bloat AI would introduce by generating new regex patterns over and over, without really considering what was already in the codebase. I started with open-source projects, which helped, but I quickly learned that many of them are designed as flexible foundations, not drop-in solutions.

They require tuning, domain-specific constraints, and careful extension. One skill I learned the hard way, more than once, was recognizing the boundary between what I actually understood and what the AI was capable of generating. Just as important was realizing that models won’t reliably tell you when a mature solution already exists. Even when you ask directly, due diligence still matters. I lost many hours trying to brute-force problems that were either already solved elsewhere or fundamentally unsolvable in the way I was framing them, like trying to perfectly balance sensitivity and false positives across every possible type of PHI identifier.

Complexity Without Understanding

The PHI identification project eventually came to a halt when a new “fast” model was released and bundled into the Cursor plan I was already using. At first, it felt like a win. The model was responsive, confident, and constantly making small changes. In retrospect, that should have been the warning sign.

I noticed fairly early that it was probably stuck in a destructive loop, but I waited too long to intervene. Because I wasn’t familiar with GitHub workflows, pull requests, or even basic version recovery, the model ended up quietly dismantling my codebase piece by piece, and I didn’t really know how to put it back together.

This wasn’t the first time something like that had happened. Earlier versions of the project had broken in smaller, more recoverable ways. Usually I could debug my way out by pasting error messages back into the model and iterating until things worked again. That pattern gave me a false sense of safety. I started to believe that patience alone would fix things, or that having multiple AI agents working on the same code at the same time was somehow efficient. What I was really doing was increasing complexity without increasing understanding.

Fine-Tuning Models

One part of this project I haven’t mentioned yet is that I did go down the fine-tuning path. I spent time in Google Colab, built small datasets, and benchmarked outputs along the way. I tracked hallucinations, safety issues, cultural sensitivity, formatting errors, and plain clinical accuracy. It showed me how finicky fine-tuning LoRA adapters can be with small models, particularly when you’re trying to improve behavior that the model already handles fairly well.

What surprised me was that, at least at this stage, the biggest gains didn’t come from more fine-tuning. They came from learning how to prompt better inside a system that was already built.

By that point, the app had structure. The system prompt was doing real work. I understood the failure modes. Instead of retraining anything, I focused on crafting a tighter prompt for the local models I were already using, applying everything I’d learned about constraints, boundaries, and verification.

That’s when things finally clicked. The PHI de-identification project I had previously spent more than fifty hours stuck on was working in one to two hours. Not because the model suddenly got smarter, and not because fine-tuning finally paid off, but because the surrounding context was finally right. With the structure already in place, better prompt craft was enough to get the de-identification feature working.

Planning with AI

That shift also forced me to rethink how I planned work with AI. Early on, I relied on large, elaborate plans that looked impressive but were hard to execute. I assumed that pushing models to their limits during planning would save time later. In practice, it usually did the opposite.

What worked better was a narrower plan with a solid framework and fewer moving parts. I also experimented with using stronger models to plan and cheaper or faster models to implement inside the IDE. That worked only to a point. The handoff often failed when the implementation required nuanced decisions the plan couldn’t fully capture.

Over just a few months, this changed dramatically as models improved. Prioritizing the strongest available models for the hardest parts of the work now makes an outsized difference. Projects that used to take ten or more hours of unfocused vibe coding can now be completed in one to three hours, with less rework and better results. Keeping up with model capabilities, and being honest about which ones are good for what, became part of the workflow itself.

Clinical Expertise

With some tenacity, a willingness to learn a few new terms, and a genuine curiosity about how things work, building with AI becomes surprisingly accessible to non-coders, even in psychology. What clinicians bring to this process is not technical polish, but domain expertise that models simply do not have.

Knowing what high-quality documentation actually looks like, when a diagnosis is appropriate, what safety checks matter, and where ambiguity needs to be surfaced rather than smoothed over shapes these systems in ways a general-purpose developer would likely miss. When clinicians build tools, even imperfect ones, that expertise shows up everywhere.

Conclusion

This approach is for clinicians who have felt friction in their workflow and wondered whether something better was possible, but assumed building was out of reach. It’s also for trainees and students who want to understand AI from the inside rather than treating it as a black box. It will resonate most with people who have already tried once and failed. Those false starts are rarely evidence that building isn’t for you. More often, they’re signs that the constraints weren’t right yet.

What this isn’t for are people looking for shortcuts or zero-effort automation. AI doesn’t replace clinical judgment, and it doesn’t eliminate the need to think carefully about structure, accuracy, and ethics. If the goal is to press a button and let the model handle everything, this approach will feel slower, not faster.

Building with AI, at least in clinical contexts, isn’t really about technology. It’s a way of thinking. Clinicians already have the domain knowledge these systems lack. When that expertise is paired with careful constraints and a willingness to experiment, the result isn’t magic. It’s something quieter and more useful: tools that fit the work instead of fighting it.

If you’re a clinician who’s felt stuck between needing better tools and not knowing how to build them, this approach might be for you. LocalScribe is what came out of that process.