Aliens School
Cinematic Knowledge Experience
0%
Aliens School
Now Playing
Aliens School ยท HIEN
โŒจ๏ธ Keyboard Shortcuts
โ†’Next slide โ†Previous slide SpacePlay / Pause MNarration on/off FFullscreen ?Show/hide this
Press any key to close
Skill Topic ยท Cinematic

๐Ÿ”ค Topic 32: Tokenizer Deep Dive โ€” BPE, WordPiece, Unigram

Course: NLP Advanced | Section 4: Pre-trained Language Models Topic: Tokenizer Algorithms โ€” BPE,โ€ฆ

Topic 1
๐Ÿ“ฅ โš™๏ธ ๐Ÿ”ฌ ๐Ÿ’ก

๐Ÿ“Œ Overview: Tokenization Kyun Critical Hai?

๐Ÿง  ` Tokenization = Text ko model-ready pieces me todna Why it matters:โ€ฆ
Topic 2
๐Ÿ”— ๐Ÿ“Š ๐Ÿ”ฌ

๐Ÿ”ท Algorithm 1: BPE (Byte-Pair Encoding)

๐Ÿ”— ` BPE: Sabse Popular Subword Algorithm Used by: GPT-2, GPT-3, GPT-4, RoBERTa, LLaMAโ€ฆ
Topic 3
โœจ

๐ŸŸข Algorithm 2: WordPiece

๐Ÿ’ก

Unknown word โ†’ greedily matchโ€ฆ

๐Ÿ”‘

Continuation marked with "##"โ€ฆ

Topic 4
โœจ

๐ŸŸก Algorithm 3: Unigram Language Model

๐Ÿ“Š ` Unigram: OPPOSITE approach of BPE! Used by: T5, ALBERT, XLNet (via SentencePiece) BPE:โ€ฆ
Topic 5
๐Ÿ“ฅ โš™๏ธ ๐Ÿ”ฌ ๐Ÿ’ก

๐ŸŸฃ SentencePiece: Language-Agnostic Framework

๐Ÿ’ก ` SentencePiece: Pre-tokenization Free! Used by: T5, ALBERT, XLNet, mT5, LLaMA Problemโ€ฆ
Topic 6
โœจ

๐Ÿ“Š Algorithm Comparison

๐ŸŽฏ ` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ BPE โ”‚โ€ฆ
Topic 7
โœจ

๐Ÿ’ป Complete Python Implementation

๐Ÿ’ก

BPE: merge most FREQUENT pair

๐Ÿ”‘

WordPiece: merge pair with highestโ€ฆ

โšก

BPE: start small, add mergesโ€ฆ

๐ŸŽฏ

Unigram: start big, remove tokensโ€ฆ

Topic 8
โœจ

โ“ Quiz โ€” Tokenizer Mastery

๐Ÿ’ก

A) BPE bottom-up hai, WordPieceโ€ฆ

๐Ÿ”‘

B) BPE frequency se merge kartaโ€ฆ

โšก

C) BPE slower hai

๐ŸŽฏ

D) WordPiece me [UNK] nahi hota

Topic 9
๐Ÿ“ฅ ๐Ÿ“ฅ ๐Ÿง  ๐Ÿ”ฌ ๐Ÿ’ก ๐ŸŽฏ

๐Ÿงญ Navigation

๐Ÿง  | Previous | Index | Next | |----------|-------|------| | 31-Fine-Tuning.md | 00-Index.mdโ€ฆ
Quick Quiz
๐Ÿง  QUIZ TIME

Quiz โ€” Question 1

๐Ÿ”ค Topic 32: Tokenizer Deep Dive โ€” BPE, WordPiece, Unigram ka sabse sahi definition kya hai?

Complete! ๐ŸŽ‰
COMPLETE

๐Ÿ”ค Topic 32: Tokenizer Deep Dive โ€” BPE, WordPiece, Unigram Complete!

Aliens School ยท HIEN ยท Cinematic Knowledge

โœ…

๐Ÿ”ค Topic 32: Tokenizer Deep Dive โ€” BPE, WordPiece, Unigram Complete

1/12
0:00
REC 00:00ESC=Cancel
Aliens School
3
Recording shuru hone wali hai...
โœ…
Recording Complete
Video process ho rahi hai...
Live Class
Slide 1 / 7
Timer
00:00
๐Ÿ“ Speaker Notes
โ€”
โญ๏ธ Up Next
โ€”
โ€”
๐Ÿ—‚๏ธ All Slides