FreshAir Labs

Building things, one experiment at a time.

What Happens When You Run an LLM on 10 Years of Kaiser Records

I downloaded my complete medical history from Kaiser — a decade of encounters, lab results, medications, vitals, and clinical notes — and ran it through a pipeline of medical language models to see what they could actually find.

The privacy problem

The obvious approach is to upload everything to Claude or GPT-4 and ask questions. But these are medical records. Names, dates, conditions, medications — the most sensitive data you have. Even with zero-retention API policies, your data still touches their servers, and de-identified records can be re-identified through account metadata and IP addresses.

So I built a three-stage pipeline:

Stage 1 (Private extraction): Google's MedGemma 27B and DeepSeek-R1, running on rented GPUs through RunPod. The models run on hardware I control, process the records, extract structured findings, and then the GPU instance is destroyed. Total cost for extraction: under $4.

Stage 2 (Manual enrichment): I add context no medical record contains — family history, lifestyle factors, personal health goals, symptoms I never mentioned to a doctor.

Stage 3 (Deep analysis): Multiple analysis passes through frontier models, each with a different lens. These could be things like cardiovascular risk scoring, screening for early cognitive decline, building a checklist for your next PCP visit, or a contrarian pass that challenges the consensus from the other analyses.

What the records actually look like

Kaiser exports in C-CDA format — XML files following the HL7 Clinical Document Architecture standard. They use medical coding systems like SNOMED CT, ICD-10, LOINC for labs, RxNorm for medications. The structured data (lab values, medication lists, immunization dates) parses cleanly with Python's lxml — no LLM needed. Where the LLM adds value is the unstructured clinical notes and cross-referencing patterns across years of visits.

What worked

The private extraction pipeline captured about 80-85% of what a frontier model would find. The biggest gap was in creative, non-obvious connections — the kind of "have you considered that these three unrelated symptoms might be connected?" insights where frontier models are noticeably stronger.

MedGemma was the standout. Most "medical" LLMs don't actually beat general-purpose frontier models on clinical tasks. MedGemma is the exception — it was trained on real FHIR EHR data and the multimodal variant can interpret imaging alongside text.

What didn't work

GPU sizing was a painful lesson. We started with instances that were too small for the models we were trying to run, and our initial debugging was terrible — we kept assuming it was a configuration or code problem, not a hardware problem. So we'd retry, wait for the pod to spin up, watch it fail again, and burn through both money and time before realizing we just needed a bigger instance.

And the RunPod workflow itself has friction. Every time you want to run an analysis, you're logging into a web dashboard, manually configuring a pod, waiting for it to provision, SSHing in, running your scripts, and tearing it down. There's no good API-first workflow for "spin up this exact configuration, run this job, give me the results, destroy everything." It's a lot of clicking for what should be a single command.

What I'd do differently

The multi-pass analysis framework generates redundant findings. Three passes would have been enough. And I'd spend more time on the manual enrichment stage — the LLM can only work with what it has, and the medical record is missing half your health story.

The takeaway

You can run meaningful medical AI analysis on your own records for under $4, keeping the sensitive data on hardware you control. The results aren't a replacement for a doctor — but they're a remarkably good preparation for your next appointment.

llm healthcare privacy runpod
← Prev: 10 Assumptions Next: Why I Started Writing →