Tokenization

1. NLP Pipeline

2. Tokenization

NLP Pipeline

Pre-Tokenization: Data의 noise 제거 → Tokenization: sequence를 program이 이해할 수 있게 변환

NLP Pipeline

Data Collection

e.g., 문서 분류 API:

Sentence - 문어체, 구어체 dataset class

(문어체 - 신문기사 - 문어체 class - 0 / 구어체 - 블로그 글 - 구어체 class - 1)

[sent1, 1] x N(문장 개수) x M(블로그 개수)

# sent1 - list of tokens

Preprocessing

- Pre-tokenization: cleaning, Normalization ···

- Tokenization

Modeling

- 학습 train 코드

- 테스트 inference 코드

Training

- Training Monitoring

Evaluation

Metrics

- 정량평가: Accuracy, F1 score ···

- 정성평가 (by 사람): Userstudy ···

Deployment

Server 효율성↑, 시간 단축 ···

Performance Monitoring

Tokenization

Model input
Piece of text를 Token이라는 smaller units으로 separte
Model은 seqeunce를 token에 의하여 recognize

Tokenization	Tokenized Sequence
Raw Text	나랑 쇼핑하자.
CV	ㄴ/ㅏ/ㄹ/ㅇ/*/ㅅ/ㅛ/ㅍ/ㅣ/ㅇ/ㅎ/ㅏ/ㅈ/ㅏ/.
Syllable	나/랑/*/쇼/핑/하/자/. (음절 단위 token)
Subword	_나랑/_쇼/핑/하/자/.
Word	나랑/쇼핑하자/.

Tokenizer

Word based Tokenizer
Character based Tokenizer: 음절, 초성 + 중성 + 종성
Subword based Tokenizer

Tokenization	Tokenized Seqeunce
Raw Text (Input)	['The devil is in the details']
Word-based Tokenizer Output	['The', 'devil', 'is', 'in', 'the', 'details']
Character-based Tokenizer Output	['T', 'h', 'e', 'd', 'e', 'v', 'i', 'l', 'i', 's', 'i', 'n', 't', 'h', 'e', 'd', 'e', 't', 'a', 'i', 'l', 's']
Subword-based Tokenizer Output	['The', 'de', 'vil', 'is', 'in', 'the', 'de', 'tail', 's']

Why use Subword-based Tokenizer?

Word-based Tokenizer problem

Out-Of-Vocabulary problem

Train dataset 40 TB, Emb_size = 256 → 40TB x 256 = 10.24 PB .. ≠ 세상 모분포 dataset → Out-of-Memory

신조어 problem

Out-Of-Vocabulary (OOV)

Character-based Tokenizer problem

Longseqeunce(['T', 'h', 'e', 'd', 'e', 'v', 'i', 'l', 'i', 's', 'i', 'n', 't', 'h', 'e', 'd', 'e', 't', 'a', 'i', 'l', 's'])

Low performance: Long sequence 처리 - Vanishing gradient, Memory problem ···

# if len_input_seqeunce > 200:

Performance ↓

∴ Use Subword-based Tokenizer

Shorter sequence
Higher performance
(Almost) Free from OOV

e.g., 신조어 '알잘딱깔쎈'

Word-based Tokenizer → Unknown data로 처리

Subword-based Tokenizer → In vocab: 알(o)/잘(o)/딱(o)/깔(o)/센(x) → 정보 손실 ↓

Subword-based Algorithms

Byte-pair Encoding (BPE)

Statistical method (e.g., GPT)

WordPiece

- Pair set의 likelihood를 maximize하게 train (e.g., BERT, DistillBERT, ELECTRA)

E.g., p('ug') / p('u' then 'g')

Unigram

- Pretokeinzedwords에서 시작하여 most common substrings then trims

SentecePiece

E.g., ALBERT, XLNet, T5

Byte-pair Encoding (BPE)

Statistical method (e.g., GPT)

E.g., aaabdaaabac

1. Z = aa

- pair aa가 가장 자주 등장하므로 이를 Z로 replace (단, Z는 기존 data에 없던 data여야 함!)

→ ZabdZabac

2. Y = ab

→ ZYdZYac

3. X = ZY

→ XdXac (X=ZY, Y=ab, Z=aa)

더 이상 가장 많이 나타나는 byte pair가 없을 때 까지 진행

- Performance ↑

- 보통, Vocab size가 50,000개가 될 때가지 진행

Image source: Byte-Pair Encoding: Subword-based tokenization | Towards Data Science

Image source: https://wikidocs.net/22592

Data-centric AI ↔ Model-centric AI

질 좋은 data in NLP?

1) (X, Y)의 균일함

2) X_raw → X_input 변환 시, 정보 손실 최소화

→ Subword-based Tokenizer의 필요성

☞ 주어진 dataset에 맞춰 Tokenizer method를 변경해보는 것이 point

∴ Data preprocessing은 굉장히 중요하다!

'NLP' 카테고리의 다른 글

Sinusoidal Positional Encoding 직접 계산해보기 (0)	2022.07.01
Transformer (0)	2022.06.30
Basic Regular expression 연습 (0)	2022.06.27
NLP preprocessing (0)	2022.06.27
RNNs with Attention (0)	2022.06.24

동영`s 인공지능 공부방

Tokenization

NLP Pipeline

NLP Pipeline

Tokenization

Tokenization

Tokenizer

Why use Subword-based Tokenizer?

∴ Use Subword-based Tokenizer

Subword-based Algorithms

Byte-pair Encoding (BPE)

∴ Data preprocessing은 굉장히 중요하다!

'NLP' 카테고리의 다른 글

티스토리툴바

Tokenization

NLP Pipeline

NLP Pipeline

Tokenization

Tokenization

Tokenizer

Why use Subword-based Tokenizer?

∴ Use Subword-based Tokenizer

Subword-based Algorithms

Byte-pair Encoding (BPE)

∴ Data preprocessing은 굉장히 중요하다!

'NLP' 카테고리의 다른 글

'NLP' Related Articles

티스토리툴바