Everyone uses Deepseek-R1 or Alibaba's newly released Qwen3 model in daily life. Their models are very powerful, and the API services they provide can also meet everyone's daily or company development needs. But you can also think of a few simple questions, as follows:
-
Is the company's data sensitive enough, and does it need to be kept confidential?
-
Is the task of using the large model in daily life very difficult, and is there a rigid demand for the inference chain?
-
What is the concurrency of the large model API called by the task? How much is the daily capital consumption?
For question 1, if the company's data is sensitive, then I suggest not to call the large model API provided by the supplier. Even if the supplier guarantees that your data will not be used for training, your data is still leaked (there will be unnecessary risks), it is recommended to deploy the large model locally.
For question 2, if the scenario problem of using the large model is difficult and the inference chain is rigidly required, then you can use the supplier's API, which can ensure that the context of the inference chain will not explode the video memory. If the problem is very simple and there is no rigid demand for the inference chain, it is recommended to deploy a small model locally.
For question 3, if the task is very simple and the concurrency of the large model API is very high, then I suggest fine-tuning a small model for a specific task and deploying it locally. This can meet high concurrency and reduce capital consumption. (Local deployment, default hardware environment single card 4090)
Seeing this, I believe everyone has thought about the above three questions and has the answer in mind. Then I will give a small example.
Demand for fine-tuning the model
Suppose your company has a task to extract user information from the text of the complaint. For example, you need to extract the user's name, address, email address, complaint issue, etc. from the following text.
This is just a small example, and the data is also produced in batches by me using a large model. The real complaint data will not be so "clean and tidy".
INPUT:
Long Lin, Block G, Donglin Street, Lu City, Ningxia Hui Autonomous Region 955491, email nafan@example.com. The garbage in the community is piled up like a mountain, the noise disturbs people's sleep at night, and parking is even more difficult. It is simply unbearable!
OUTPUT:
{
"name": "Long Lin",
"address": "955491, Block G, Chengdonglin Street, Lu City, Ningxia Hui Autonomous Region",
"email": "nafan@example.com",
"question": "Garbage piles up in the community, noise disturbs people's sleep at night, parking is difficult, it's unbearable!"
}
Of course, you can call Deepseek's most powerful model R1, or you can call Alibaba's latest and most powerful model Qwen3-235B-A22B, etc. The information extraction effect of these models is also very good.
But there is a problem. If you have millions of such data to process, calling all the latest and best large models may cost tens of thousands of yuan. In addition, if these complaint data, such as telecommunications complaint data and power grid complaint data, are sensitive and cannot be directly put on the external network.
So, comprehensive data sensitivity and capital consumption. The best option is to fine-tune a small model (such as Qwen3-0.6B), which can ensure high concurrency, data non-leakage, model extraction effect, and save money! ! !
Next, let's use a small case to take you through the practice, fine-tune the Qwen3-0.6B small model to complete the text information extraction task.
Configure the environment and download the data
Colab file address: https://colab.research.google.com/drive/18ByY11KVhIy6zWx1uKUjSzqeHTme-TtU?usp=drive_link
!pip install datasets swanlab -q
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1a0sf5C209CLW5824TJkUM4olMy0zZWpg' -O fake_sft.json
Process the data
from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig
from peft import LoraConfig, TaskType, get_peft_model
import torch
# Convert JSON file to CSV file
df = pd.read_json('fake_sft.json')
ds = Dataset.from_pandas(df)
ds[:3]
model_id = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
tokenizer
The data format for supervised-finetuning
(sft
, supervised fine-tuning) of a large language model is as follows:
{
"instruction": "Answer the following user question and only output the answer.",
"input": "What is 1+1?",
"output": "2"
}
Among them, instruction
input
is the user instruction, telling the model what task it needs to complete; input
is the user input, which is the input content required to complete the user instruction; output
is the output that the model should give.
The goal of supervised fine-tuning is to enable the model to understand and follow user instructions. Therefore, when constructing a dataset, we should construct data specifically for our target task. For example, if our goal is to fine-tune a model that can role-play the dialogue style of Zhen Huan through a large number of characters' dialogue data, the data example in this scenario is as follows:
{
"instruction": "Who is your father?",
"input": "",
"output": "My father is Zhen Yuandao, the Shaoqing of Dali Temple."
}
The Chat Template
format used by Qwen3
is as follows:
Since Qwen3
is a hybrid reasoning model, you can manually choose to turn on the thinking mode
Do not turn on thinking mode
messages = [
{"role": "system", "content": "You are a helpful AI"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "I'm fine, think you. and you?"},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
print(text)
<|im_start|>system
You are a helpful AI<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
</think>
I'm fine, think you. and you?<|im_end|>
<|im_start|>assistant
<think>
</think>
The data for LoRA
(Low-Rank Adaptation
) training needs to be formatted and encoded before being input into the model for training. We need to first encode the input text as input_ids
and the output text as labels
. The result after encoding is a vector. We first define a preprocessing function, which is used to encode the input and output text of each sample and return an encoded dictionary:
def process_func(example):
MAX_LENGTH = 1024 # Set the maximum sequence length to 1024 tokens
input_ids, attention_mask, labels = [], [], [] # Initialize the return value
# Adapt chat_template
instruction = tokenizer(
f"<s><|im_start|>system\n{example['system']}<|im_end|>\n"
f"<|im_start|>user\n{example['instruction'] + example['input']}<|im_end|>\n"
f"<|im_start|>assistant\n<think>\n\n</think>\n\n",
add_special_tokens=False
)
response = tokenizer(f"{example['output']}", add_special_tokens=False)
# Concatenate the input_ids of the instruction part and the response part, and add the eos token at the end as the token to mark the end
input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
# Attention mask, indicating the position that the model needs to pay attention to
attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]
# For instruction, use -100 to indicate that these positions do not calculate loss (that is, the model does not need to predict this part)
labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]
if len(input_ids) > MAX_LENGTH: # Truncate beyond the maximum sequence length
input_ids = input_ids[:MAX_LENGTH]
attention_mask = attention_mask[:MAX_LENGTH]
labels = labels[:MAX_LENGTH]
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels
}
tokenized_id = ds.map(process_func, remove_columns=ds.column_names)
tokenized_id
tokenizer.decode(tokenized_id[0]['input_ids'])
tokenizer.decode(list(filter(lambda x: x != -100, tokenized_id[1]["labels"])))
Load model
Load the model and configure LoraConfig
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto",torch_dtype=torch.bfloat16)
model
Qwen3ForCausalLM(
(model): Qwen3Model(
(embed_tokens): Embedding(151936, 1024)
(layers): ModuleList(
(0-27): 28 x Qwen3DecoderLayer(
(self_attn): Qwen3Attention(
(q_proj): Linear(in_features=1024, out_features=2048, bias=False)
(k_proj): Linear(in_features=1024, out_features=1024, bias=False)
(v_proj): Linear(in_features=1024, out_features=1024, bias=False)
(o_proj): Linear(in_features=2048, out_features=1024, bias=False)
(q_norm): Qwen3RMSNorm((128,), eps=1e-06)
(k_norm): Qwen3RMSNorm((128,), eps=1e-06)
)
(mlp): Qwen3MLP(
(gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
(up_proj): Linear(in_features=1024, out_features=3072, bias=False)
(down_proj): Linear(in_features=3072, out_features=1024, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
(post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
)
)
(norm): Qwen3RMSNorm((1024,), eps=1e-06)
(rotary_emb): Qwen3RotaryEmbedding()
)
(lm_head): Linear(in_features=1024, out_features=151936, bias=False)
)
model.enable_input_require_grads() # 开启梯度检查点时,要执行该方法
Lora Config
LoraConfig
这个类中可以设置很多参数,比较重要的如下
task_type
:模型类型,现在绝大部分decoder_only
的模型都是因果语言模型CAUSAL_LM
target_modules
:需要训练的模型层的名字,主要就是attention
部分的层,不同的模型对应的层的名字不同r
:LoRA
的秩,决定了低秩矩阵的维度,较小的r
意味着更少的参数lora_alpha
:缩放参数,与r
一起决定了LoRA
更新的强度。实际缩放比例为lora_alpha/r
,在当前示例中是32 / 8 = 4
倍lora_dropout
:应用于LoRA
层的dropout rate
,用于防止过拟合
from peft import LoraConfig, TaskType, get_peft_model
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
inference_mode=False, # 训练模式
r=8, # Lora 秩
lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理
lora_dropout=0.1# Dropout 比例
)
config
model = get_peft_model(model, config)
config
model.print_trainable_parameters() # 模型参数训练量只有0.8395%
trainable params: 5,046,272 || all params: 601,096,192 || trainable%: 0.8395
Training Arguments
output_dir
:模型的输出路径per_device_train_batch_size
:每张卡上的batch_size
gradient_accumulation_steps
: 梯度累计num_train_epochs
:顾名思义epoch
args = TrainingArguments(
output_dir="Qwen3_instruct_lora",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
logging_steps=1,
num_train_epochs=3,
save_steps=50,
learning_rate=1e-4,
save_on_each_node=True,
gradient_checkpointing=True,
report_to="none",
)
SwanLab 简介
SwanLab 是一个开源的模型训练记录工具,面向 AI 研究者,提供了训练可视化、自动日志记录、超参数记录、实验对比、多人协同等功能。在 SwanLab
上,研究者能基于直观的可视化图表发现训练问题,对比多个实验找到研究灵感,并通过在线链接的分享与基于组织的多人协同训练,打破团队沟通的壁垒。
为什么要记录训练
相较于软件开发,模型训练更像一个实验科学。一个品质优秀的模型背后,往往是成千上万次实验。研究者需要不断尝试、记录、对比,积累经验,才能找到最佳的模型结构、超参数与数据配比。在这之中,如何高效进行记录与对比,对于研究效率的提升至关重要。
(2) Use an existing SwanLab account
并使用 private API Key 登录
import swanlab
from swanlab.integration.transformers import SwanLabCallback
# 实例化SwanLabCallback
swanlab_callback = SwanLabCallback(
project="Qwen3-Lora", # 注意修改
experiment_name="Qwen3-8B-LoRA-experiment" # 注意修改
)
import swanlab
from swanlab.integration.transformers import SwanLabCallback
# 实例化SwanLabCallback
swanlab_callback = SwanLabCallback(
project="Qwen3-Lora",
experiment_name="Qwen3-0.6B-extarct-lora-2"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_id,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
callbacks=[swanlab_callback]
)
trainer.train()
测试文本
prompt = "龙琳 ,宁夏回族自治区璐市城东林街g座 955491,nafan@example.com。小区垃圾堆积成山,晚上噪音扰人清梦,停车难上加难,简直无法忍受!太插件了阿萨德看见啊啥的健康仨都会撒娇看到撒谎的、"
messages = [
{"role": "system", "content": "将文本中的name、address、email、question提取出来,以json格式输出,字段为name、address、email、question,值为文本中提取出来的内容。"},
{"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
enable_thinking=False).to('cuda')
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
{
"name": "龙琳",
"address": "宁夏回族自治区璐市城东林街g座 955491",
"email": "nafan@example.com",
"question": "小区垃圾堆积成山,晚上噪音扰人清梦,停车难上加难,简直无法忍受!太插件了阿萨德看见啊啥的健康仨都会撒娇看到撒谎的、"
}
Post comment 取消回复