Unsloth Studio 使用问题记录
离线环境,官方docker镜像,k8s环境使用
1、导出模型时从github克隆llama.cpp处理
手动下载源码https://github.com/ggml-org/llama.cpp传到unsloth内
根据报错日志的路径,创建目录
mkdir/home/unsloth/.unsloth/llama.cpp-p把llama.cpp源码放进去
/opt/venv/bin/python3-m pip install gguf protobuf sentencepiece mistral_common #自己指定源处理下,或者加变量重试,就不会从github自动拉了,pip源记得配,或者人工干预,ps -ef 看下他在装什么包
可以看到已经完成了llama.cpp的源码构建安装
手动转换,不依赖Unsloth
# 转换并输出 f16 GGUF cd llama.cpp python convert_hf_to_gguf.py./my_unsloth_model \--outtype f16 \--outfile my_model_f16.gguf后面还有代码写死从github下载https://github.com/ggerganov/llama.cpp/raw/refs/heads/master/convert_hf_to_gguf.py
改变量没测,直接改代码
unsloth@unsloth-studio-7fd9b89dcd-mjd8x:/workspace/llama.cpp$ python-m http.server8081ServingHTTPon0.0.0.0port8081(http://0.0.0.0:8081/)...127.0.0.1--[26/Apr/202613:40:37]"GET /convert_hf_to_gguf.py HTTP/1.1"200-127.0.0.1--[26/Apr/202613:42:35]"GET /convert_hf_to_gguf.py HTTP/1.1"200-vim/opt/venv/lib/python3.12/site-packages/unsloth_zoo/llama_cpp.py #报错代码55行成功转GGUF并量化
2、数据集生成报错
index-CY5egRSv.js:73 POST http://10.103.184.147:8000/api/data-recipe/validate 500 (Internal Server Error)
持久化的workspace目录要给unsloth用户权限
要选Full Run
根据 Unsloth Studio 的官方设计:
Preview Run(预览运行):仅用于快速调试,不会生成持久化的本地数据集文件,因此不会出现在Train页面的数据集列表里。
Full Run(完整运行):才会生成可被训练页面识别的、持久化的数据集文件,之后会自动出现在Local标签的列表中。
你的第三张截图里,所有运行记录都标着Preview,说明你只跑了预览,没有执行完整的数据集生成流程,所以 Train 页面自然找不到数据集。
3、开始训练
预期效果,3万条数据, 30000
模型训练报错1:
Repo id must be in the form ‘repo_name’ or ‘namespace/repo_name’: ‘/workspace/work/heretic8-glm4.7-flash.Q4_K_M.gguf’. Userepo_typeargument if needed.
不支持gguf,所以前一篇文章单独做了烧蚀
模型训练报错2:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to setllm_int8_enable_fp32_cpu_offload=Trueand pass a customdevice_maptofrom_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
这个是显存不够,56G模型昨天在1/2mig没起来,今天换了80G A800
离线的unsloth pod内调整完一些配置建议重新构建个镜像,以免每次重新安装llama.cpp,配置pip源等
要做持久化的路径
volumeMounts: - mountPath: /workspace name: dsdata subPath: unslothdata/workspace - mountPath: /home/unsloth/.unsloth/studio name: dsdata subPath: unslothdata/studio - mountPath: /data/GLM name: dsdata subPath: GLM-4.7-Flash4、数据集合并,昨天每本书单独一个训练集,还有一个10本大杂烩,把这些合并到一起一次性完成训练
1、把数据集文件都打包拿出来,自己写python脚本或者让AI处理
2、数据集放回Unsloth数据集目录(选择local的时候可以看到你合并后的),或者训练页面手动上传数据集
训练的参数配置
我这台window是tiny11-core,缺少中文包或者不知道什么情况,好多小说文件在这台机器编码有问题。本地claude写了个转换脚本
#!/usr/bin/env python3# -*- coding: utf-8 -*-""" 通用书籍乱码修复工具 v2.0 / Universal Book Encoding Fixer 用法 / Usage:# 修复单个文件python fix_encoding.py"book.txt"# 修复单个文件并指定输出名python fix_encoding.py"input.txt""output.txt"# 修复目录下所有 txt 文件(默认递归子目录)python fix_encoding.py--dir"./books"# 修复多种格式python fix_encoding.py--dir"./books"--ext.txt .novel .umd# 覆盖原文件(谨慎使用)python fix_encoding.py--dir"./books"--overwrite# 查看帮助python fix_encoding.py--help"""importosimportsysimportargparseimportre from pathlibimportPath from datetimeimportdatetime# Force UTF-8 for console output on Windowsifsys.platform=='win32':try: sys.stdout.reconfigure(encoding='utf-8')sys.stderr.reconfigure(encoding='utf-8')except Exception: pass class EncodingFixer:"""书籍编码修复器""" def __init__(self): self.gbk_misread_map=self._build_gbk_map()self.stats={'total':0,'success':0,'failed':0,'skipped':0,'encodings':{}}def _build_gbk_map(self):"""Build GBK misread character mapping""" mapping={}forb1inrange(0x81, 0xFE):forb2inrange(0x40, 0xFE):ifb2==0x7F:continuegbk_byte=bytes([b1, b2])try: correct_char=gbk_byte.decode('gbk')misread_chars=chr(b1)+ chr(b2)mapping[misread_chars]=correct_char except Exception:continuereturnmapping def read_file(self, filepath):"""智能读取文件,尝试多种编码""" encodings=['utf-8','gbk','gb2312','big5','utf-16-le','utf-16-be']with open(filepath,'rb')as f: raw_bytes=f.read()# Check BOM markersifraw_bytes.startswith(b'\xef\xbb\xbf'):returnraw_bytes[3:].decode('utf-8'),'utf-8-sig'elifraw_bytes.startswith(b'\xff\xfe'):returnraw_bytes[2:].decode('utf-16-le'),'utf-16-le'elifraw_bytes.startswith(b'\xfe\xff'):returnraw_bytes[2:].decode('utf-16-be'),'utf-16-be'elifraw_bytes.startswith(b'\x2b\x2f\x76'):returnraw_bytes[3:].decode('utf-7'),'utf-7'# Try encodingsforencinencodings: try: content=raw_bytes.decode(enc)ifcontent.count('\ufffd')<len(content)*0.05:returncontent, enc except Exception:continuereturnraw_bytes.decode('gbk',errors='replace'),'gbk-replace'def fix_mojibake(self, text):"""修复乱码""" text=text.replace('\ufffd','')result=[]i=0whilei<len(text):ifi +1<len(text): pair=text[i:i+2]ifpairinself.gbk_misread_map: result.append(self.gbk_misread_map[pair])i+=2continueresult.append(text[i])i+=1return''.join(result)def clean_text(self, text):"""清理文本""" text=re.sub(r'[\u200b\u200c\u200d\ufeff]','', text)text=re.sub(r'[ \t]+$','', text,flags=re.MULTILINE)returntext def process(self, input_path,output_path=None,overwrite=False):"""处理单个文件""" input_path=Path(input_path)ifnot input_path.exists(): raise FileNotFoundError(f"File not found: {input_path}")# Readcontent, detected_encoding=self.read_file(input_path)original_len=len(content)# Cleancontent=self.clean_text(content)# Fix mojibakecontent=self.fix_mojibake(content)# Determine output pathifoutput_path is None:ifoverwrite: output_path=input_path else: output_path=input_path.parent /(input_path.stem +'_fixed.'+ input_path.suffix)else: output_path=Path(output_path)# Check if output already existsifoutput_path.exists()and not overwrite:return{'input':str(input_path),'output':str(output_path),'status':'skipped','reason':'output exists'}# Write with UTF-8with open(output_path,'w',encoding='utf-8')as f: f.write(content)return{'input':str(input_path),'output':str(output_path),'encoding':detected_encoding,'original_len':original_len,'fixed_len':len(content),'status':'success'}def process_directory(self, dir_path, extensions,recursive=True,overwrite=False):"""处理目录中的所有文件""" dir_path=Path(dir_path)ifnot dir_path.exists(): raise FileNotFoundError(f"Directory not found: {dir_path}")pattern='**/*'ifrecursiveelse'*'files=sorted([fforfindir_path.glob(pattern)iff.is_file()and f.suffix.lower()inextensions])self.stats['total']=len(files)total_chars=0print(f"Found {len(files)} files to process")print(f"Recursive: {recursive}, Overwrite: {overwrite}")print("-"*60)foridx, filepathinenumerate(files,1): rel_path=filepath.relative_to(dir_path)print(f"[{idx}/{len(files)}] {rel_path}",end=" ")try: result=self.process(filepath,overwrite=overwrite)ifresult['status']=='skipped':self.stats['skipped']+=1print("[SKIPPED - output exists]")else: self.stats['success']+=1total_chars+=result['fixed_len']enc=result.get('encoding','unknown')self.stats['encodings'][enc]=self.stats['encodings'].get(enc,0)+1print(f"[OK] {enc} -> UTF-8 ({result['fixed_len']} chars)")except Exception as e: self.stats['failed']+=1print(f"[ERROR] {e}")return{'stats':self.stats,'total_chars':total_chars}def print_summary(summary):"""打印统计摘要""" print("\n"+"="*60)print("PROCESSING SUMMARY / 处理摘要")print("="*60)print(f"Total files: {summary['stats']['total']}")print(f"Success: {summary['stats']['success']}")print(f"Skipped: {summary['stats']['skipped']}")print(f"Failed: {summary['stats']['failed']}")print(f"Total chars: {summary['total_chars']:,}")ifsummary['stats']['encodings']: print("\nDetected encodings / 检测到的编码:")forenc, countinsorted(summary['stats']['encodings'].items(),key=lambda x: -x[1]): print(f" {enc}: {count}")print("="*60)def main(): parser=argparse.ArgumentParser(description='Book Encoding Fixer v2.0 - Fix Chinese book encoding issues',formatter_class=argparse.RawDescriptionHelpFormatter,epilog=""" Examples:# Fix single filepython fix_encoding.py book.txt# Fix with custom output namepython fix_encoding.py input.txt output.txt# Fix all files in directory (recursive by default)python fix_encoding.py--dir./books# Fix multiple formatspython fix_encoding.py--dir./books--ext.txt .novel .umd# Overwrite original files (use with caution!)python fix_encoding.py--dir./books--overwrite# Non-recursive modepython fix_encoding.py--dir./books --no-recursive""")parser.add_argument('input',nargs='?',help='Input file path')parser.add_argument('output',nargs='?',help='Output file path (optional)')parser.add_argument('--dir','-d',help='Process all files in directory')parser.add_argument('--ext','-e',nargs='+',default=['.txt'],help='File extensions (default: .txt)')parser.add_argument('--overwrite','-w',action='store_true',help='Overwrite original files')parser.add_argument('--recursive','-r',action='store_true',default=True,help='Recursively process subdirectories (default: True)')parser.add_argument('--no-recursive',action='store_true',help='Do not process subdirectories')args=parser.parse_args()fixer=EncodingFixer()# Normalize extensions to lowercaseextensions=[e.lower()ife.startswith('.')else'.'+ e.lower()foreinargs.ext]# Determine recursive moderecursive=not args.no_recursive# Process directoryifargs.dir: try: summary=fixer.process_directory(args.dir, extensions,recursive=recursive,overwrite=args.overwrite)print_summary(summary)except Exception as e: print(f"Error: {e}")sys.exit(1)# Process single fileelifargs.input: try: result=fixer.process(args.input, args.output,overwrite=args.overwrite)ifresult.get('status')=='skipped':print(f"Skipped: output file already exists")print(f" Output: {result['output']}")else: print("\nProcessing complete!")print(f" Input: {Path(result['input']).name}")print(f" Output: {Path(result['output']).name}")print(f" Encoding: {result.get('encoding', 'N/A')} -> UTF-8")print(f" Size: {result.get('fixed_len', 0):,} characters")except Exception as e: print(f"Error: {e}")sys.exit(1)else: parser.print_help()sys.exit(1)if__name__=='__main__':main()要跑21小时,挂着等结果吧
