
Trelis Research
チャンネル登録者数 1.85万人
4149 回視聴 ・ 174いいね ・ 2024/09/25
Distillation of Transformer Models
➡️ Get Life-time Access to the Complete Scripts (and future improvements): Trelis.com/ADVANCED-fine-tuning
➡️ One-click fine-tuning and LLM templates: github.com/TrelisResearch/one-click-llms
➡️ Newsletter: blog.Trelis.com/
➡️ Resources/Support/Discord: Trelis.com/About
➡️ Thumbnail made with this tutorial: • Fine Tune Flux Diffusion Models with ...
With credit to Rohan Sharma for work on these scripts on a Trelis Internship: trelis.com/internships/. Find Rohan on GitHub: github.com/rs545837/
Thanks also to Elie Bakouch of HuggingFace for guidance on using SmolLM corpus: huggingface.co/eliebak
VIDEO RESOURCES:
Slides: docs.google.com/presentation/d/1dQf2CuvmbIpo5Ir35n…
Minitron Distillation Paper: d1qx31qr3h6wln.cloudfront.net/publications/minitro…
Distil-Whisper Paper: arxiv.org/pdf/2311.00430
SmolLM Corpus: huggingface.co/datasets/HuggingFaceTB/smollm-corpu…
Trelis SmolLM 2% split: huggingface.co/datasets/Trelis/smollm-corpus-2perc…
WebInstruct: huggingface.co/datasets/TIGER-Lab/WebInstructSub
TIMESTAMPS:
0:00 AI model distillation (Whisper, Flux, Minitron, gpt-4o-mini?)
0:46 Video Overview - Distillation Tutorial and Code Walk-through
2:00 Distillation Examples (Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron)
6:51 How distillation works
7:22 Student model initialization
8:36 Layer / depth pruning
11:52 Width pruning
15:25 Pre-training versus distillation
18:40 Cross-entropy loss vs KL-divergence
22:41 Instruction fine-tuning
23:28 Distilling SmolLM 135M to a 99M model
24:43 Code walk-through setup.
26:49 Pruning Notebook
28:56 Layer Pruning
31:41 Width Pruning
35:01 Why pruning works?
36:17 Distillation Script - Multi-GPU Setup
39:36 Distillation Script Walk-through
54:05 Distillation Configuration File Walk-through
56:32 Distillation Startup and Performance Monitoring with tensorboard
1:03:01 Instruction fine-tuning and dataset selection
1:09:02 Instruction FT Startup and Performance Monitoring with tensorboard
1:12:40 Running inference to evaluate distillation performance
1:12:54 Teacher model performance (base SmolLM 135M)
1:13:53 SmolLM Instruct model performance
1:14:15 Raw pruned model performance (layer pruned) 99M
1:14:38 Width + Layer pruning performance (raw) 99M
1:15:18 Distilled model performance (before instruction tuning) 99M
1:15:57 Instruction tuning performance evaluation
1:16:21 SmolLM 135M Instruct performance
1:17:17 Instruction tuned distilled model performance (99M model)
1:18:33 Final Tips (best pruning approach, learning rate, batch size and model size effects)
1:20:21 Video Resources
コメントを取得中...