My local LLM clone failed where the data failed

I tried to make a local LLM that sounded like me, and the useful lesson was not which checkpoint won. The useful lesson was that my data was the wrong shape for the thing I wanted.

The project was still worth doing. I learned the local fine-tuning path, built a small private chat app, and got enough output to see why the idea is tempting. But if someone asked me where to start with a "clone yourself" local LLM project, I would not start with a model comparison. I would start with the data. If the examples do not represent the interaction you want the model to perform, another LoRA run mostly gives you a cleaner progress bar.

I have shelved this project for now. I had fun with it, but I do not want to spend the time collecting data that matches the target behavior at the moment. That is the other lesson: it is fine to learn what a project can teach you, accept the limit, and stop.

The export looked bigger than it was

The source was a Discord data export. Discord documents the account-data request flow, and the export I had locally was useful because it was private, familiar, and large enough to make the experiment feel possible.

The first pass parsed 80,102 messages. After cleaning, 75,724 messages remained. The May 2024 onward slice produced 7,835 grouped responses and an estimated 287,498 tokens in the completion dataset. The MLX split for that dataset was 7,051 training records, 392 validation records, and 392 test records.

On paper, that sounds like a decent personal style dataset.

The problem was the shape. The official export I had appeared to contain my authored messages, not the previous speaker turns that caused them. That meant the dataset was mostly continuation material: fragments of me, grouped into bursts, without reliable conversational prompts.

That distinction matters if the goal is a local model that can reply like you. A pile of your messages can teach surface habits. It does not automatically teach when those habits belong.

I made stricter variants because the raw dataset was too blunt. The prefix/rest split kept local Discord cadence but removed obvious mechanical junk: huge pasted blocks, URLs, files, commands, attachments, likely secrets, and oversized groups. That reduced the useful strict split to 983 training records, 55 validation records, and 54 test records.

That was the trade-off in one number. The more I filtered toward examples I actually trusted, the less data there was.

Fine-tuning was not the bottleneck

The training setup was not the awkward part. MLX-LM supports text generation and fine-tuning on Apple Silicon, and its LoRA documentation covers local JSONL datasets, validation files, adapters, and generation from adapters. That was enough to run the experiment locally without uploading private Discord data, prompts, reports, or adapters.

I tried multiple model branches. Qwen2.5 3B trained, but the qualitative review was not useful. Qwen2.5 7B failed the base-generation sanity gate before training. Qwen3 4B looked more promising before training, then brought assistant/template behavior back after fine-tuning.

The least-bad branch was mlx-community/Mistral-7B-v0.3-4bit, using the local adapter at:

models/adapters/may_2024_onwards_prefix/mistral_7b_v0_3_4bit_text_lora/0000400_adapters.safetensors

I am deliberately saying "least-bad" rather than "best". The model path mattered, but the model was not the main lesson. I could change base models, checkpoint numbers, sampling settings, and dataset variants, but the outputs kept circling the same problem: the model was being asked to infer an interaction from data that mostly did not contain the interaction.

That is where validation loss becomes a weak comfort. A run can look more successful in training logs and still fail the actual taste test. For this project, the taste test was direct: open the local chat app, talk to it, and see whether the replies felt like something I would plausibly send.

Most of the time, I could tell quickly that they did not.

The chat app became the real eval

The useful thing I built was not a perfect clone. It was a local app that made the failure easier to inspect.

The current runtime path uses Mistral plus the 0400 LoRA adapter, then wraps it in local retrieval and routing. Retrieval is lexical, BM25-style, and local. Candidate generation is local. Scoring and reranking are local. There are no hosted generation calls, external embeddings, external judges, or uploaded private examples in the loop.

That wrapper retrieves short examples from the local prefix training split, samples multiple candidate continuations, rejects obvious bad outputs, and picks the least-bad reply using a local Billie-shape heuristic.

The held-out routing report from 28 May 2026 made the improvement visible:

54 held-out samples
direct adapter average Billie-shape score: 15.6889
retrieval-routing average Billie-shape score: 37.287
routed output beat direct output on 48 of 54 samples
direct adapter hard rejects: 21 of 54
selected routed hard rejects: 0 of 54

That is useful evidence, but it is not proof of a solved clone. The Billie-shape score is a local reranking heuristic. It can reward surface texture. It can punish ordinary real completions. It is a tool for choosing between local candidates, not a benchmark for whether the model is meaningfully me.

The stronger evidence came from using the app. If a reply felt wrong, it felt wrong immediately. That kind of evaluation is subjective, but for a personal clone project it is also unavoidable. You are not only measuring whether the model can produce fluent text. You are measuring whether you can stand talking to it.

Routing helped, but it did not fix the data

The chat picker audit shows the value and the limit of the routing layer.

In one local audit, five prompts produced 70 candidates. Candidate-level raw speaker loops appeared 35 times. Hidden bad raw text appeared 37 times. After rejection and ranking, the selected replies had 0 raw speaker-loop winners, 0 hidden-bad-raw winners, 0 context-copy-risk winners, and 0 duplicate selected replies.

That is exactly the kind of thing routing is good at. It can stop the worst candidates from reaching the chat surface. It can compare direct candidates against retrieved-context candidates. It can make copy risk, loops, terse answers, and weird raw continuations visible.

It cannot make weak data into strong data.

That became the practical tension. I could keep training, or I could accept that the local app had shown me the real bottleneck. More training was tempting because it looked like progress. Better data was the thing that would actually change the ceiling.

If I came back to this, I would not start by running another checkpoint. I would start by collecting or constructing data that matches the target behavior:

actual prompt and reply pairs, not just my side of the conversation
enough context to know why a reply was sent
examples filtered for the kind of reply I want the model to learn
a local preference set from blind candidate comparisons
an app-level eval where I can talk to it and mark what feels wrong

Only then would another training run be more than tinkering.

The rule I would use next time

If you want to build a local LLM clone of yourself, the fun path is to fine-tune something quickly and see what happens. That is a good way to learn the tools.

The serious path starts earlier. Decide what the model is meant to imitate. Is it autocomplete for your writing? A chat reply generator? A tone transfer tool? A private toy that only needs to make you laugh? Those are different datasets.

My mistake was treating "lots of my messages" as close enough to "examples of how I respond". It was close enough to make the project interesting. It was not close enough to make it good.

So the stopping rule is this: when the app is mostly teaching you that the data is wrong, stop optimizing around the data. Either collect examples that match the target behavior or put the project down.

For now, I am putting it down. The useful output is not a clone. It is a clearer rule for the next attempt: data first, routing second, training third, and stop when the experiment has already answered the question.