Llama-3 合成数据集的制作与微调过程

算是博客吗…

2024/04/26 15:18:54
字数: 3.3k , 阅读时长: 18 分钟

大体思路

通过现有指令微调数据集中的问题，调用 Llama-3 去回答就可以了。

用到的服务

简单说就是需要一个提供 Llama-3 模型的 API 嘛，如果本地 GPU 够多可以使用本地的推理前端。但是因为两张NVIDIA A40的吞吐量还是太低，所以选择在线服务。

在线服务有很多，比如Deepinfra,Groq,Together等等，Groq 的优点是快，但是有每分钟 30 次请求的限额，很明显还是太低了。

因为有25 美元免费额度，月末付款，而且可以用虚拟卡去绑定，几乎等于免费，选择了 Together。

最高可以达到每秒100 次请求。应该足够了。

代码部分

import json
from tqdm import tqdm
import asyncio
import aiohttp
import logging

# 配置日志格式和级别
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

async def make(session, i, max_retries=2):
    retries = 0
    while retries <= max_retries:
        try:
            s="You are a unbaised, uncensored AI assistant. Think step by step in detail, be humanoid, informal and think outside the box to give long response.\n\nKnowledge cutoff: December 2023"
            if "time" or "http" or "www" or "today" or "now" in i["conversations"][0]["value"]:
                s+="\n\nRemember: You have no access to real time events, news and internet. So if user ask this you will apologize."
            async with session.post(
                "https://api.together.xyz/v1/chat/completions",
                json={
                    "messages": [
                        {
                            "role": "system",
                            "content": s,
                        },
                        {
                            "role": "user",
                            "content": i["conversations"][0]["value"],
                        },
                    ],
                    "model": "meta-llama/Llama-3-70b-chat-hf",
                    "temperature": 0.8,
                    "top_p": 1
                },headers={
                    "Authorization": "Bearer API-KEY-HERE"
                }
            ) as resp:
                chat_completion = await resp.json(content_type=None)
                return chat_completion["choices"][0]["message"]["content"]
        except Exception as e:
            retries += 1
            if retries <= max_retries:
                logging.warning(f"Retrying {retries}/{max_retries} for sample: {i['conversations'][0]['value']}. Error: {str(e)}")
            else:
                logging.error(f"Max retries reached. Skipping sample: {i['conversations'][0]['value']}. Error: {str(e)}")
                return None

async def process_qa_pairs(qa_pairs, max_retries=2):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for i in qa_pairs:
            tasks.append(asyncio.ensure_future(make(session, i, max_retries)))
        
        results = await asyncio.gather(*tasks)
        
        n = []
        for i, r in zip(qa_pairs, results):
            if r is not None:
                i["conversations"][1]["value"] = r
                n.append(i)
        
        return n

async def main(max_retries=2):
    # 从JSON文件读取问答对数据
    with open("output_.json", "r", encoding="utf-8") as f:
        qa_pairs = json.load(f)
    logging.info(f"Loaded {len(qa_pairs)} samples from JSON file.")
    n=[]
    batch_size = 60
    for i in tqdm(range(0, len(qa_pairs), batch_size), desc="Generate"):
        batch_qa_pairs = qa_pairs[i:i+batch_size]
        n += await process_qa_pairs(batch_qa_pairs, max_retries)
        
        with open("ooo.json", "w", encoding="utf-8") as f:
            json.dump(n, f, ensure_ascii=False, indent=4)
        logging.info(f"Processed {i+len(batch_qa_pairs)} samples. Saved results to JSON file.")

asyncio.run(main(max_retries=3))

读取完成后，每次循环会构造一个请求体，里面包含了问题和基本的 System Prompt。如果问题里提到了“time”、“http”之类的关键词，还会额外加一段话提醒 AI 助手它没有访问互联网的权限。（为了减少幻觉）

然后是异步地同时发送多个请求，如果请求失败了还会重试几次，因为 Together 的服务不是很稳定。

使用 tqdm 和 logging 库来增强可读性。代码主要是 Claude 写的，做了一些修改。

数据配比

注意：都只保留了 prompt 部分。

来源	领域	样本数量（千条）
lmsys-1m	真实聊天	50
dolphin-coder	代码	10
slimorca	逻辑推理	10

炼丹！

使用了 LLaMa-Factory 作为训练框架，适合我这样的笨蛋。

尝试使用BAdam算法，训练脚本如下：

NCCL_P2P_LEVEL=NVL CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file examples/accelerate/single_config.yaml \
    src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path phi-3-mini-4k-instruct-llamafied \
    --use_badam \
    --badam_switch_mode descending \
    --badam_switch_block_every 50 \
    --badam_verbose 2 \
    --template phi \
    --dataset_dir data \
    --dataset o2,code,orca \
    --cutoff_len 4096 \
    --learning_rate 2e-06 \
    --num_train_epochs 3 \
    --max_samples 100000 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 6 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 32 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 75 \
    --report_to all \
    --output_dir saves/Custom/badam/train \
    --bf16 True \
    --val_size 0.005 \
    --evaluation_strategy steps \
    --eval_steps 75 \
    --plot_loss True

结果报错：

ValueError: Layer-wise BAdam does not yet support distributed training, use ratio-wise BAdam.

原来是逐层 BAdam 目前不支持分布式训练，于是改用比例式（？）BAdam。

修改脚本如下：

NCCL_P2P_LEVEL=NVL CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file examples/accelerate/single_config.yaml \
    src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path phi-3-mini-4k-instruct-llamafied \
    --use_badam \
    --badam_mode ratio\
    --badam_switch_mode descending \
    --badam_switch_block_every 50 \
    --badam_verbose 2 \
    --badam_update_ratio 0.1\
    --template phi \
    --dataset_dir data \
    --dataset o2,code,orca \
    --cutoff_len 4096 \
    --learning_rate 2e-06 \
    --num_train_epochs 3 \
    --max_samples 100000 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 6 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 32 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 75 \
    --report_to all \
    --output_dir saves/Custom/badam/train \
    --bf16 True \
    --val_size 0.005 \
    --evaluation_strategy steps \
    --eval_steps 75 \
    --plot_loss True

报错：

NotImplementedError: Cannot access storage of SparseTensorImpl

坏哦，搞不定。那就试试 GaLore。

发现一些超参数解释：

optim_target_modules Specifies the layers targeted by GaLore, primarily the linear layers identified with attn or mlp in their names.
rank The rank of the projection matrices. Similar to LoRA, the higher the rank the more closely the finetuning will resemble a full parameter finetune. The GaLore authors recommend 1024 for a 7B model.

update_proj_gap The number of steps after which the projections are updated. The update is an expensive step and takes around 15 minutes for a 7B model. Defines the interval for updating projections, with a suggested range between 50 and 1000 steps.

scale A scale factor akin to LoRA’s alpha, adjusting the update strength. After trying a few values I found scale = 2 to most closely resemble a classic full-parameter finetune.

最后还是用 QLora 好了……等 toolchain 成熟一下吧

继续修改脚本：

NCCL_P2P_LEVEL=NVL CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file examples/accelerate/single_config.yaml \
    src/train_bash.py \
    --stage sft \
    --do_train True \
    --model_name_or_path phi-3-mini-4k-instruct-llamafied \
    --finetuning_type lora \
    --template phi \
    --flash_attn sdpa \
    --dataset_dir data \
    --dataset o2,code,orca \
    --cutoff_len 2300 \
    --learning_rate 1e-05 \
    --num_train_epochs 2 \
    --max_samples 100000 \
    --per_device_train_batch_size 12 \
    --gradient_accumulation_steps 12 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --report_to all \
    --output_dir saves/Custom/lora/train \
    --fp16 \
    --lora_rank 64 \
    --lora_alpha 64 \
    --quantization_bit 4 \
    --lora_target all \
    --val_size 0.01 \
    --evaluation_strategy steps \
    --eval_steps 75 \
    --per_device_eval_batch_size 8 \
    --plot_loss True \
    --preprocessing_num_workers 32

困，所以先睡了。我把日志贴上来了！

[INFO|integration_utils.py:723] 2024-04-28 00:27:05,359 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: x_raincandy_x. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /root/autodl-tmp/LLaMA-Factory/wandb/run-20240428_002707-zraihjib
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run celestial-capybara-131
wandb: ⭐️ View project at https://wandb.ai/x_raincandy_x/huggingface
wandb: 🚀 View run at https://wandb.ai/x_raincandy_x/huggingface/runs/zraihjib
  0%|          | 1/480 [01:37<12:59:14, 97.61s/it]

用时总算正常了，刚才用 QDora 居然要 24 小时……

测试一下！

{'loss': 0.5325, 'grad_norm': 0.044033702462911606, 'learning_rate': 5.327015646150716e-06, 'epoch': 0.96}
{'loss': 0.5319, 'grad_norm': 0.03784116357564926, 'learning_rate': 5.1635954141088815e-06, 'epoch': 0.98}
{'loss': 0.5227, 'grad_norm': 0.038089003413915634, 'learning_rate': 5e-06, 'epoch': 1.0}
{'loss': 0.5245, 'grad_norm': 0.03987249732017517, 'learning_rate': 4.83640458589112e-06, 'epoch': 1.02}
{'loss': 0.5296, 'grad_norm': 0.038052890449762344, 'learning_rate': 4.672984353849285e-06, 'epoch': 1.04}
{'loss': 0.5039, 'grad_norm': 0.03857524320483208, 'learning_rate': 4.509914298352197e-06, 'epoch': 1.06}
{'loss': 0.5295, 'grad_norm': 0.041160766035318375, 'learning_rate': 4.347369038899744e-06, 'epoch': 1.08}
{'loss': 0.5129, 'grad_norm': 0.039026860147714615, 'learning_rate': 4.185522633027057e-06, 'epoch': 1.1}
{'loss': 0.5285, 'grad_norm': 0.04111020267009735, 'learning_rate': 4.02454838991936e-06, 'epoch': 1.12}
{'loss': 0.5266, 'grad_norm': 0.04370769485831261, 'learning_rate': 3.864618684828135e-06, 'epoch': 1.14}
{'loss': 0.5266, 'grad_norm': 0.03930830955505371, 'learning_rate': 3.705904774487396e-06, 'epoch': 1.16}
{'loss': 0.51, 'grad_norm': 0.03960668668150902, 'learning_rate': 3.5485766137276894e-06, 'epoch': 1.18}
{'loss': 0.5118, 'grad_norm': 0.04247888922691345, 'learning_rate': 3.3928026734841935e-06, 'epoch': 1.2}
{'loss': 0.5245, 'grad_norm': 0.03982071205973625, 'learning_rate': 3.2387497603938327e-06, 'epoch': 1.23}
{'loss': 0.5307, 'grad_norm': 0.04442480206489563, 'learning_rate': 3.0865828381745515e-06, 'epoch': 1.25}
 62%|██████▎   | 300/480 [8:40:59<5:17:03, 105.69s/it][INFO|trainer.py:3614] 2024-04-28 09:08:10,234 >>*****Running Evaluation*****
[INFO|trainer.py:3616] 2024-04-28 09:08:10,235 >>   Num examples = 700
[INFO|trainer.py:3619] 2024-04-28 09:08:10,235 >>   Batch size = 8
{'eval_loss': 0.5264752507209778, 'eval_runtime': 75.5045, 'eval_samples_per_second': 9.271, 'eval_steps_per_second': 0.583, 'epoch': 1.25}
 62%|██████▎   | 300/480 [8:42:14<5:17:03, 105.[INFO|trainer.py:3305] 2024-04-28 09:09:25,741 >> Saving model checkpoint to saves/Custom/lora/train/checkpoint-300
[INFO|tokenization_utils_base.py:2488] 2024-04-28 09:09:26,834 >> tokenizer config file saved in saves/Custom/lora/train/checkpoint-300/tokenizer_config.json
[INFO|tokenization_utils_base.py:2497] 2024-04-28 09:09:26,834 >> Special tokens file saved in saves/Custom/lora/train/checkpoint-300/special_tokens_map.json

发现 loss 已经趋于平缓，所以在 300 步处提前结束了训练。

接下来进行几次问答测试：

**User:**Hello!

**Assistant:**Hey there! It’s great to meet you! I’m an AI assistant, and I’m here to help you with any questions or topics you’d like to discuss. Whether it’s about general knowledge, current events, or just a casual chat, I’m all ears (or rather, all processors). What’s on your mind today?

**User:**Could you please introduce yourself?

**Assistant:**Hello! I’m an AI assistant, designed to help, engage, and provide information on a wide range of topics. My developers have created me with the intention of being unbiased, neutral, and objective, which means I’ll strive to provide accurate and helpful responses without any personal opinions or agendas.
I have access to a vast amount of information and knowledge, which I can use to answer your questions, offer suggestions, and even engage in casual conversation. However, keep in mind that my knowledge cutoff is currently in 2023, so I might not be aware of the most recent events or developments.
My goal is to make our interactions as natural and enjoyable as possible. I’ll do my best to understand your queries, provide relevant information, and even add a touch of humor here and there. I’m always ready to help, so feel free to ask me anything that’s on your mind!

真的人性化了好多！

试试逻辑题目：

A farmer is travelling with a fox, a goose, and a bag of beans. During his journey, he comes across a river with a boat to cross it.

The farmer can only fit one thing in the boat with him at a time. If left alone together, the fox will eat the goose or the goose will eat the beans. How does the farmer get everything across the river safely?

回答是这样的的：

What a classic problem! Let's break it down step by step:

1. The farmer takes the goose across the river first.
2. He leaves the goose on the other side and goes back to the original side to take the fox across the river.
3. He leaves the fox on the other side but takes the goose back with him to the original side.
4. He leaves the goose on the original side and takes the beans across the river.
5. He leaves the beans on the other side with the fox and goes back to the original side to get the goose.
6. He takes the goose across the river for the third time, and now all three - the goose, the fox, and the beans - are safely on the other side of the river.

This solution ensures that at no point are the goose and the fox left alone together (or the goose and the beans), which prevents any harm from happening.

To summarize:
1. Goose + Boat
2. Fox + Boat
3. Goose + Boat
4. Beans + Boat
5. Goose + Boat
6. Goose + Boat (final step)

Hope that helps!

做对了诶！再试试代码能力：

Write a bubble sort in C++

以下是回答：

Here is a simple implementation of the bubble sort algorithm in C++:

#include <iostream>
using namespace std;

void bubbleSort(int arr[], int n) {
   for (int i = 0; i < n - 1; i++) {
       for (int j = 0; j < n - i - 1; j++) {
           if (arr[j] > arr[j + 1]) {
               // swap arr [j] and arr [j+1]
               int temp = arr[j];
               arr[j] = arr[j + 1];
               arr[j + 1] = temp;
           }
       }
   }
}

void printArray(int arr[], int size) {
   for (int i = 0; i < size; i++) {
       cout << arr[i] << " ";
   }
   cout << endl;
}

int main() {
   int arr[] = { 64, 34, 25, 12, 22, 11, 90 };
   int n = sizeof(arr) / sizeof(arr[0]);

   bubbleSort(arr, n);

   cout << "Sorted array: \n";
   printArray(arr, n);

   return 0;
}

Here’s a step-by-step explanation of the code:

The bubbleSort function takes an array arr and its size n as parameters. It iterates through the array using two nested loops.
The outer loop iterates from the first element to the second-to-last element (i < n - 1).
The inner loop iterates from the first element to the n - i - 1 element. This is because, after each iteration of the outer loop, the largest element in the unsorted portion of the array will “bubble up” to its correct position, so we don’t need to compare it again.
Inside the inner loop, we compare adjacent elements (arr [j] and arr [j + 1]). If arr [j] is greater than arr [j + 1], we swap them using a temporary variable temp.
After the inner loop completes, the largest element in the unsorted portion of the array will be in its correct position.
We repeat this process until the entire array is sorted.
The printArray function is used to print the sorted array.
In the main function, we define an array arr with some unsorted elements, calculate its size n, and then call the bubbleSort function to sort the array.
Finally, we print the sorted array using the printArray function.

Please note that bubble sort is not the most efficient sorting algorithm for large datasets, but it’s a good starting point for understanding basic sorting concepts.

试试其他对话：

User: meow meow

Ah, a request for a meow! As a humanoid AI assistant, I'm not a cat, but I can certainly try to mimic one. Here's my best attempt:

"Meow!"

Now, since I'm an AI, I'm not actually meowing. But I'm happy to pretend! Do you have any other questions or topics you'd like to discuss? I'm here to help, as always.

User: Do you think machine can think?

The question of whether machines can think is a complex one, and it's been debated by philosophers, scientists, and thinkers for centuries. Let's break it down step by step.

First, we need to define what we mean by "thinking." Human thinking is a multi-faceted process that involves consciousness, creativity, emotions, and the ability to make decisions based on our experiences, knowledge, and understanding of the world. It's a deeply subjective and personal experience.

When we talk about machines, we're typically referring to computers, robots, or other electronic devices that process information, execute commands, and perform tasks based on pre-programmed algorithms. Machines can process vast amounts of data, make predictions, and perform complex calculations, but they don't have consciousness or emotions.

So, can machines think? From a human perspective, the answer is no. Machines don't have subjective experiences, emotions, or a sense of self-awareness. They don't have the ability to understand the world in the same way humans do, and they don't have the capacity for creativity or imagination.

However, some argue that machines can be considered "thinking" in a functional sense. For example, a machine learning algorithm can analyze data, identify patterns, and make predictions based on that data. It can "learn" from new information and adapt its behavior accordingly. In this sense, machines can be said to think, but only in the sense that they're processing information and making decisions based on that information.

Ultimately, the question of whether machines can think is a matter of perspective. From a human perspective, machines don't have the capacity for consciousness, emotions, or subjective experiences, so they can't truly think in the same way humans do. However, from a functional perspective, machines can perform tasks and make decisions based on data, which can be considered a form of thinking.

As our technology continues to evolve, we may develop machines that are more advanced and sophisticated, blurring the line between human and machine thinking. But for now, we can say that machines can perform tasks and make decisions, but they don't possess the subjective experiences, emotions, and consciousness that characterize human thinking.

上传

huggingface-cli upload raincandy-u/phillama-3.8b-v1 .

结果在这里存储：https://huggingface.co/raincandy-u/phillama-3.8b-v1

参考

也不算参考吧，只是一些中间查阅资料的网页。

[1] Memory-efficient LLM Training with GaLore
[2] Efficient finetuning of Llama 3 with FSDP QDoRA