Fine-Tuning Field Notes #2: What We Actually Trained with QLoRA

The run used Qwen3 4B, a 32-example dataset, and QLoRA to train 33M adapter parameters while leaving the 4.05B-parameter base model frozen.

Taylor Ortiz

Jun 17, 2026

The phrase “fine-tuned a model” hides a lot.

It sounds like the whole model changed.

In this experiment, that is not what happened.

The base model had about 4.05 billion parameters. The training run updated about 33 million trainable parameters.

That worked out to roughly 0.81% of the model.

That number helped the whole process click for me.

We loaded a capable open-weight model, froze the original weights, attached a small set of trainable adapter weights, and trained those adapters on a narrow routing task.

The base model carried the general language capability.

The adapter learned the task behavior.

The setup

The experiment used:

Base model: Qwen3 4B Instruct
Method: QLoRA
Dataset: 32 examples
Split: 24 train / 4 validation / 4 test
Trainable parameters: 33,030,144
Total parameters: 4,055,498,240
Percentage trained: 0.81%

The task was to classify internal AI requests into a governed routing schema.

Given a request like:

A sales analyst wants to query curated opportunity data with SQL and build a dashboard for regional pipeline trends.

the model needed to return a structured decision:

{
  "tier": "Trusted Technical Analyst",
  "risk": "medium",
  "recommended_platform": "Databricks + governed dashboarding",
  "needs_review": true,
  "reason": "The request requires SQL and governed dashboarding over business data."
}

The base model was already close. It understood the words in the request. It understood SQL, dashboards, agents, Salesforce, ServiceNow, and support workflows.

The thing I wanted to change was the model’s operating behavior.

Use these tier names. Use these platform labels. Set review required when governed data or production writeback appears. Return the response as strict JSON.

That is where QLoRA came in.

The base model

The base model was Qwen3 4B Instruct.

That model already had general language capability. It could read the request, understand the business context, and produce a reasonable answer.

Before training, though, “reasonable” was not enough.

For a SQL/dashboard request, the base model returned something like:

{
  "recommended_platform": "Databricks + Tableau",
  "needs_review": false
}

That answer made sense generically. It did not match the routing policy I wanted.

The desired behavior was:

{
  "recommended_platform": "Databricks + governed dashboarding",
  "needs_review": true
}

This is the type of gap where fine-tuning can be useful.

The model does not need new facts. It needs a more specific response pattern.

Full fine-tuning

Full fine-tuning updates the original model weights.

If you fully fine-tune a 4 billion parameter model, those original parameters can move during training.

That gives the training process a lot of flexibility. It also requires more memory, more compute, and more care.

Training needs to track model weights, gradients, optimizer states, activations, and batches. Those memory demands stack up quickly.

A 4B model may sound small compared to frontier models, but full training is still heavy.

For this experiment, full fine-tuning would have been the wrong starting point.

The goal was to learn the loop, adapt behavior, run evals, and iterate quickly.

LoRA

LoRA stands for Low-Rank Adaptation.

The important idea is simple:

Freeze the base model. Add small trainable matrices inside selected layers. Train those matrices.

The original model weights stay in place.

The adapter learns a correction.

A simplified version looks like this:

original layer output + LoRA update = adapted layer output

The model still uses the original Qwen weights. The LoRA adapter nudges some of the internal transformations.

This is why the saved artifact is much smaller than the full model.

The folder I saved was called:

trusted_innovator_router_lora_v1

That folder is the trained adapter.

It is not a standalone 4B model. It has to be loaded with the base model to produce the fine-tuned behavior.

The practical model is:

Qwen3 4B Instruct + trusted_innovator_router_lora_v1

That distinction matters.

If someone says they “fine-tuned a model with LoRA,” the artifact they trained may be an adapter, not a full copy of the model.

Where the adapter goes

LoRA adapters are inserted into selected modules inside the transformer.

For this run, the target modules were:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

The first group belongs to attention:

q_proj
k_proj
v_proj
o_proj

Attention helps the model decide which tokens matter to each other.

In this routing task, the model needs to connect clues like:

SQL
dashboard
curated data
Salesforce
ServiceNow
writeback
Glean

to the right output fields.

The second group belongs to the feed-forward part of the transformer, often called the MLP:

gate_proj
up_proj
down_proj

MLP stands for multi-layer perceptron.

In older neural network language, an MLP is basically a stack of fully connected layers. In a transformer, the MLP is the part of each block that works on the representation after attention has mixed information across tokens.

A simple way to think about it:

attention decides what information should be connected
MLP transforms that information into useful internal features

If attention helps the model notice that a request mentions SQL, dashboards, Glean, Salesforce, or writeback, the MLP helps turn those signals into higher-level behavior.

For this task, that behavior looks like:

Glean-only request → Trusted Functional Builder
SQL/dashboarding → Trusted Technical Analyst
Salesforce writeback → Trusted Technical Builder
governed data → needs_review true

Modern Llama/Qwen-style models use MLP projections with names like gate_proj, up_proj, and down_proj.

A rough mental model:

up_proj expands the representation
gate_proj controls what information passes through
down_proj compresses it back to the model dimension

That is not the full math, but it is enough to understand why LoRA adapters are often added there.

The adapter was inserted into both attention and MLP modules because the task required more than noticing keywords. The model needed to map those clues into a specific schema and policy.

Attention helped with the clues.

The MLP helped with the transformation from clues to routing behavior.

Rank

The LoRA config used:

r = 16

The r value is the LoRA rank.

A higher rank gives the adapter more capacity. It trains more parameters and can learn more complex changes. It also uses more memory and can overfit more easily on a tiny dataset.

A lower rank gives the adapter less capacity. It is smaller and faster, but may not learn enough.

For this first run, r = 16 was a reasonable starting point.

One thing I had to separate in my head:

target_modules controls where LoRA is inserted.
r controls how much capacity each adapter has.

Those are different choices.

The target modules decide which parts of the model get trainable adapters.

The rank decides the size of those trainable updates.

Quantization

The “Q” in QLoRA comes from quantization.

Quantization stores the base model weights in a smaller numeric format.

Instead of loading the base model in 16-bit or 32-bit precision, we loaded it in 4-bit.

That was this setting:

load_in_4bit = True

The model weights are still numbers. Quantization stores those numbers with less precision so the model uses less memory.

A rough way to think about it:

FP32: more precision, more memory
FP16/BF16: less memory, common for GPU work
8-bit: compressed
4-bit: very compressed

The tradeoff is precision.

The benefit is memory.

For this experiment, 4-bit loading made it practical to run Qwen3 4B on a Tesla T4 and train the adapter.

The base model was compressed. The LoRA adapter was the small trainable part.

That is the QLoRA pattern:

4-bit base model + trainable LoRA adapter

What trained

During training, the base model stayed frozen.

The adapter weights moved.

That is why the training summary mattered:

Trainable parameters = 33,030,144 of 4,055,498,240
0.81% trained

The optimizer was not updating every Qwen parameter. It was updating the LoRA adapter weights attached to selected layers.

The dataset showed the model examples like:

Marketing wants a Glean agent that summarizes campaign briefs and drafts LinkedIn posts.

with a target output like:

{
  "tier": "Trusted Functional Builder",
  "risk": "low",
  "recommended_platform": "Glean",
  "needs_review": false
}

The training process compared the model’s predicted next tokens to the target tokens.

If the target said:

Glean

and the model assigned more probability to:

no-code agent builder

the adapter got nudged so that Glean became more likely in similar contexts.

If the target said:

needs_review: true

for a SQL/dashboard request, the adapter got nudged toward that pattern.

Over many token-level corrections, the adapter learned the routing behavior.

What did not train

The model did not learn language from scratch.

It did not learn what SQL is from my dataset.

It did not learn what Salesforce is from my dataset.

It did not memorize an enterprise knowledge base.

The base model already brought general capability. The adapter learned how I wanted that capability expressed for a narrow task.

That distinction is important because fine-tuning is often described too broadly.

For this experiment, the fine-tune was about response behavior:

exact tier names
exact platform names
review policy
structured JSON
routing consistency

That made it a good fit for adapter training.

If the task had been “answer questions from current internal documentation,” I would have reached for retrieval first.

Why this matters

The 0.81% number changed how I think about model adaptation.

It made the process feel less like rewriting a model and more like attaching a small policy-specific behavior layer.

That is powerful.

It also creates some practical constraints.

The adapter can shift behavior, but it is still sitting on top of the base model. If the dataset has gaps, the adapter will learn those gaps. If the examples over-associate writeback with Salesforce, the adapter may do the same. If the inference template is wrong, the learned behavior may not show up cleanly.

That showed up later in the eval.

The model correctly learned the broad routing policy, but it over-associated business-record writeback with Salesforce in one held-out test case.

That failure was not random. It reflected the data.

The adapter learned the examples. The next version needs better examples.

The mental model I’m keeping

After this run, this is how I think about QLoRA:

Base model: general capability
Quantization: makes the base model fit in memory
LoRA adapter: small trainable behavior patch
Dataset: product spec
Training: token-level updates to the adapter
Eval: tells you what the adapter actually learned

That is a much clearer picture than “I fine-tuned a model.”

The model had 4.05 billion parameters.

We trained about 33 million.

That was enough to change the routing behavior, expose the importance of the dataset, and create a useful v1 adapter.

Next, I want to look more closely at the dataset itself.

The adapter learned the product spec we gave it.

It also learned the bias inside that spec.

Another Coding Blog

Discussion about this post

Ready for more?