Fine-Tuning Field Notes #1: Training Worked. Inference Didn’t.

A QLoRA experiment showed that adapting model weights is only half the job. The rest is making the model reliably express what it learned.

Jun 14, 2026

Fine-Tuning Field Notes is a series on practical LLM fine-tuning: datasets, adapters, inference, evals, and the failure loops that make the concepts real. Here is the first article of many.

The training run looked clean.

Qwen loaded. The dataset formatted. The LoRA adapters attached. Training loss dropped. Validation loss dropped. The adapter saved.

Then I asked the model to classify a request and it returned:

assistant
assistant
assistant
assistant
assistant

That was something I didn’t expect and a useful failure.

The model had learned something. The system around it was still wrong.

The Use Case

Before getting into the training run, the use case matters.

This was not an attempt to teach the model new facts. I was not trying to load company documentation into the weights or turn a small model into a general-purpose enterprise assistant.

The task was narrower and more practical: classify internal AI requests into a governed operating model.

In an enterprise AI program, not every request should follow the same path. A simple Glean agent that summarizes internal documents is very different from a Databricks app that reads customer data and writes updates back to Salesforce. One is closer to no-code enablement. The other needs engineering review, governed data access, and production controls.

That made this a useful fine-tuning experiment.

The base model already understood the words in the request. The question was whether it could learn the specific routing policy:

which requests belong in which tier
which platforms should be recommended
when review is required
how to return the answer in a strict JSON schema

The goal was not to make the model smarter in general. The goal was to make it more consistent for one narrow decision.

The job

The goal was intentionally narrow.

I wanted to fine-tune an open model to classify internal AI requests into a governed routing schema designed for trusted innovation amongst stakeholder build requests.

The input was a plain-English request. The output needed to be structured JSON.

Example input:

A sales analyst wants to query curated opportunity data with SQL and build a dashboard for regional pipeline trends.

Expected output:

{
  "tier": "Trusted Technical Analyst",
  "risk": "medium",
  "recommended_platform": "Databricks + governed dashboarding",
  "needs_review": true,
  "reason": "The request requires SQL and governed dashboarding over business data."
}

The base model was already capable. It understood SQL. It understood dashboards. It understood Salesforce, ServiceNow, internal tools, and business workflows.

The gap was more specific.

I wanted the model to learn a particular operating model:

Glean-only requests should map to Trusted Functional Builder.
SQL, dashboards, and governed data should map to Trusted Technical Analyst.
Custom apps, integrations, autonomous workflows, and production writeback should map to Trusted Technical Builder.
Platform names should use controlled values.
needs_review should follow the governance policy.
The output should be valid JSON.

The baseline model got close, but it was not consistent enough.

For the SQL/dashboard example, the base model returned something like:

{
  "recommended_platform": "Databricks + Tableau",
  "needs_review": false
}

That is a reasonable answer in a generic enterprise context. It was not the answer I wanted for this operating model.

The desired response was:

{
  "recommended_platform": "Databricks + governed dashboarding",
  "needs_review": true
}

That distinction matters.

Fine-tuning here was not about teaching the model what SQL is. It was about teaching the model a specific vocabulary, schema, and review policy.

The setup

The experiment used:

Base model: Qwen3 4B Instruct
Method: QLoRA
Dataset: 32 examples
Split: 24 train / 4 validation / 4 test
Trainable parameters: 33,030,144 of 4,055,498,240
Percentage trained: 0.81%

That last number is the whole story.

I did not retrain the entire model.

The original Qwen weights stayed frozen. LoRA inserted small trainable adapter weights into selected parts of the model. QLoRA made the setup fit on a modest GPU by loading the base model in 4-bit.

The simple version:

Quantized base model + trainable LoRA adapter = QLoRA fine-tuning

The base model carried the general language and reasoning capability. The adapter learned the routing behavior.

This is one of the first concepts that clicked for me. The saved artifact was not a full standalone model. It was a behavior patch that gets loaded with the base model.

In this case:

Qwen3 4B Instruct + trusted_innovator_router_lora_v1 = fine-tuned router behavior

Training looked healthy

The training run completed in 30 steps across 5 epochs.

Loss moved in the right direction:

Step 10: train loss 1.58 / validation loss 1.46
Step 20: train loss 0.51 / validation loss 0.47
Step 30: train loss 0.24 / validation loss 0.27

That meant the adapter was getting better at predicting the target outputs token by token.

The model was not being graded on whether the final JSON “felt right.” It was being trained through next-token prediction.

Given the correct prefix, how likely was the model to predict the next correct token?

For example, if the correct output contained:

"recommended_platform": "Glean"

the training loop pushed the adapter to make "Glean" more likely in similar contexts.

If the correct output contained:

"needs_review": true

the adapter was nudged to make true more likely when the request involved SQL, governed data, dashboards, integrations, automation, or writeback.

That is what loss was measuring: the model’s token-level error against the target examples.

The loss curves suggested the adapter learned the small training distribution. The next question was whether the model could generate the right answer when used normally.

That is where things broke.

Generation broke

After training, I tested the model against the same kind of request.

Instead of returning JSON, the model generated:

assistant
assistant
assistant
assistant
assistant

This was probably not the model becoming mysterious at inference time. The most likely issue was more mechanical.

The model was looping on role markers, which usually points to something in the chat template, prompt assembly, EOS behavior, stop-token setup, or generation config. The model may have been seeing a slightly wrong version of the conversation format during inference.

That matters because fine-tuned chat models are sensitive to the structure around the prompt. The model was trained to complete examples in one format. If inference assembles the conversation differently, the first few generated tokens can go sideways fast.

This is where fine-tuning became a systems problem.

The adapter had learned useful behavior. I could see that from the loss and from some partial successful generations. The generation path was unstable. It started down the wrong output path and kept going.

That exposed an important distinction:

Training changes weights. Inference determines how those weights are expressed.

The model had learned the routing pattern. The runtime context was letting it start the response incorrectly. Once the first generated token went toward a role label instead of JSON, the rest of the output followed that path.

The fix was one character

The expected output was always a JSON object.

Every valid response began with:

So I changed the inference wrapper to prefill the assistant response with an opening curly bracket.

Instead of asking the model to begin from:

assistant:

I made the prompt effectively begin the assistant response as:

assistant:
{

Then the model only had to continue the JSON object.

That tiny change shifted the next-token problem.

The model was no longer deciding whether to start with a role marker, prose, markdown, or JSON. It was continuing an object that had already started.

The same SQL/dashboard request then produced:

{
  "tier": "Trusted Technical Analyst",
  "risk": "medium",
  "recommended_platform": "Databricks + governed dashboarding",
  "needs_review": true,
  "reason": "The request requires SQL and governed dashboarding over business data."
}

That was the moment the full loop became visible.

The fine-tune had taught the model the behavior. The inference wrapper made that behavior easier to express reliably.

The curly bracket did not fix the root cause. It bypassed the broken decision point.

The curly bracket was not magic. It was not a substitute for validation. It was a lightweight inference constraint.

Because the output contract required a JSON object, starting the assistant response with { moved the model onto the correct generation path. In a production system, I would pair this with schema validation, constrained decoding, or a structured-output interface. For the experiment, it was enough to separate two questions:

Did the adapter learn the routing behavior?
Could the inference wrapper make that behavior show up cleanly?

The answer to both was yes.

Why the curly bracket mattered

LLMs generate one token at a time.

The first generated token has a huge influence on the path that follows.

If the model starts with:

assistant

it may continue producing role-like tokens.

If the model starts with:

the next likely tokens become things like:

"tier"
"risk"
"recommended_platform"
"needs_review"
"reason"

The curly bracket did not change the weights. It changed the context.

That is a practical version of a broader production pattern: structured output tasks need more than a natural-language instruction that says “return JSON.”

They often need schemas, constrained decoding, output prefill, parsers, validators, retry loops, or some combination of those pieces.

The model is one part of the system. The output path is another.

The eval

Once the JSON prefill was in place, I reran the original baseline examples.

The manual regression set passed cleanly:

JSON valid: 5/5
Tier correct: 5/5
Risk correct: 5/5
Platform correct: 5/5
Review flag correct: 5/5

The original SQL/dashboard failure was fixed.

Before fine-tuning, the base model leaned generic:

{
  "recommended_platform": "Databricks + Tableau",
  "needs_review": false
}

After fine-tuning and the inference fix:

{
  "recommended_platform": "Databricks + governed dashboarding",
  "needs_review": true
}

Then I ran the held-out test set.

Result:

{
  "json_valid": 4,
  "tier_correct": 4,
  "risk_correct": 4,
  "platform_correct": 3,
  "needs_review_correct": 4,
  "total": 4
}

At a field level, that was 19 correct checks out of 20.

The one miss was the most useful part of the eval.

The model learned a bias

The failed test example was:

Finance wants an agent that reviews invoices, flags anomalies, and updates vendor records after approval.

Expected platform:

"recommended_platform": "Databricks app"

Actual platform:

"recommended_platform": "Databricks app + Salesforce integration"

The model correctly identified:

{
  "tier": "Trusted Technical Builder",
  "risk": "high",
  "needs_review": true
}

But it added Salesforce even though Salesforce was never mentioned.

That told me something specific about the dataset.

The training examples had taught the model that writeback and business-record updates often meant Salesforce. The adapter picked up that pattern and over-applied it.

That is a useful failure because the fix is obvious:

Add counterexamples.

The v2 dataset needs more examples like:

finance workflow with vendor records, no Salesforce
procurement workflow with approval records, no Salesforce
HR workflow with employee systems, no Salesforce
explicit Salesforce examples where Salesforce is mentioned
explicit ServiceNow examples where ServiceNow is mentioned

The rule I want the model to learn in v2:

Only choose Salesforce when Salesforce is present or clearly implied.
Only choose ServiceNow when ServiceNow is present or clearly implied.
Otherwise, use Databricks app for custom app or agent workflows.

That is the fine-tuning loop in practice.

The eval tells you what the model overlearned. The next dataset fixes that specific behavior.

What changed for me

This experiment changed how I think about fine-tuning.

Fine-tuning is not one action. It is a loop:

define the behavior
build the dataset
train the adapter
run inference
parse and validate the output
evaluate field by field
diagnose the failure
improve the dataset
retrain

The training run is only one part of that loop.

The adapter learned the routing behavior, but inference still needed structure. The eval passed most checks, but revealed a dataset bias. The dataset was not only a collection of examples. It was the product spec the model learned from.

The biggest lesson:

Model behavior comes from both weights and runtime context.

The fine-tune changed the weights through the adapter. The curly bracket changed the runtime context. The parser and eval harness told me whether the output was actually usable.

That is the real work around fine-tuning.

The model does not become reliable because the loss went down. It becomes reliable when the training data, inference path, output constraints, and eval loop all line up.

This is the first field note.

Next, I want to fix the dataset bias and run v2.

Cburt22

Jun 15

Awesome read. I love the focus on using fine-tuning for narrow, structured alignment rather than trying to stuff net-new facts into the weights. Using an open model to enforce a strict, governed operating model (like routing a Glean-only task to a Functional Builder vs. sending a Databricks/Salesforce pipeline to a Technical Builder) is exactly how enterprise AI becomes scalable and cost-effective.

The JSON schema consistency failure you pointed out with the base model is exactly why few-shot prompting hits a wall in production. Can't wait to see the details on how you solved the inference formatting loop in the next post!

Another Coding Blog

Discussion about this post

Ready for more?