Blog
April 29, 2026
A founder's note on where MAIA's medical coding agent is, where it isn't, and the data wall I just hit.
I've been building MAIA and finding PMF by myself for the last 4, almost 5 months. It's been arduous. I learn a lot and get humbled all the time. I plan to release "white papers" or "blog updates" every couple of months to give out updates and personal insights to anyone who's interested. It doesn't really matter to me whether it's investors, colleagues, friends, or family. These are my thoughts and insights on this journey. I'll try to keep it focused on MAIA but when you make a startup your everything it will get personal. I hope you guys enjoy the journey as much as I do (sometimes).
The three biggest things I've been working on: PMF through potential customer interviews, building the agentic AI infrastructure, and getting each agent's specific functionality reliable.
For PMF this has mostly happened through cold calls, cold emails, and just being a yapper. The yapping is fun. It needs to be focused, but I've met and connected with so many cool people. Even if this thing doesn't work out, I built a great network and learned about a lot of cool problems. They were hyperspecific, but I think that's the nature of trying to sell to small businesses (that's a whole other conversation).
For the agentic infrastructure, this has posed some issues. There are a variety of electronic health records and the model needs to be ready to handle any of them. Some knowledge can be shared between them but overall. It's very individualistic. These are real moats that I can cross with time but oh boy!
The biggest GTM for MAIA would be an efficient medical coder agent. A lot of these physicians outsource their coding and billing services. I've even been told some physicians make their own companies overseas in order to have cheaper medical coding done.
Pulling this off has overall been difficult. There has been some progress, but not as much as I had hoped.
I measure my results on a public test set that academic research uses for this exact problem. The headline number is "first try accuracy": when the system suggests one best code for a case, how often is it exactly right? We are now at about 46 out of every 100. We started at 36, which is roughly where the major general-purpose AI assistants land when you ask them to do this cold. Across six iteration cycles we have moved that 36 up by about ten absolute points, which is a 27 percent relative gain.
For context: the biggest, most quoted general purpose AI models score somewhere between the high 20s and mid 40s on this same task. The published academic baseline that researchers in this space cite sits at about 40. We are now slightly above it. Research's current state of the art on the same dataset is around 60, but it leans on a couple of techniques we have not implemented yet, and that gap is well-understood.
One specialty in particular, cardiology, lands at about 80 first try accuracy.
I see no more gains from architectural improvements for the medical coding agent. I have swallowed the lake that is open source data to train on. My evaluation set is totally separate so there's no data leakage. My only real gains are from real data and deployment. This is where I will struggle. Private medical practices do not have the capacity for mistakes from agents. This will only be solved through connections. Through a technical lens, it's not like the architecture or the system is bad, it has never seen this kind of case enough times to learn it. No change to the existing pipeline fixes those. Only seeing more of those cases does.
The biggest thing I can do to fix the agent is to pair it with real coders. Their corrections become the next round of training data. Up until now, I have been training on what I could get my hands on. From here, I need to train on what real coders disagree with MAIA about. There is no clever workaround. Every research group that has gotten past this number has done so by getting closer to actual coders.
I'm not defeated by this.
The plateau is exactly where small teams get stuck on this kind of problem, and the path off it is pretty well known in other domains. The wrinkle here is that an agent failure in healthcare lands a lot harder on the end user than it does in most software, which makes the usual "ship the worst version and iterate" advice harder to follow. The underlying recipe hasn't really changed though. I still need this in front of real users, even if the pilot has to be tightly chaperoned.
Second, the system is already wired for the kind of "continual" learning this needs. It produces flagged suggestions with confidence scores and surfaces what it isn't sure about. Bolting on a correction loop is all about signing with a willing partner.
Third, cardiology specific notes are hitting ~80%. This tells me the architecture is in the right neighborhood. If I narrow the scope to a specialty where I already have enough of the right kind of data, the agent already works at a usable level. This could be my starting point for the next partnership.
I'm looking for a tightly bounded pilot with a single specialty practice, ideally cardiology. The goal is straightforward: every code the system suggests gets reviewed by a real coder, every disagreement gets captured, and that disagreement becomes the training signal for the next version. I'll report back in a few months either way, partnership here or not.
If you run a coding team or a specialty practice, or you know someone who does and would be open to a careful pilot of this kind, I'd love to hear from you. I'm a one person team, but I will always make time for customers and partners.
Thanks for listening to my yap. It's back to building and finding PMF. The next writeup comes sooner if I have wins to share and later if I'm grinding through more failures and lessons.
Email me directly. I'm scoping a tightly bounded medical coding partnership and would love to talk.
Email maia@maiamed.ai