Bag-of-Words

＊사용된 모든 영문 image의 출처는 cs231n 강의 자료입니다.＊

<Bag-of-Words>

1. Word Embedding

2. Bag-of-Words representation

3. Naive Bayes Classifier

Word Embedding

< Vector가 가지는 의미 >

	Bag-of-Words 가정	Language model	분포 가정
내용	어떤 word가 많이 쓰였는가 (빈도 수)	word가 어떤 순서로 쓰였는가 ex) 나는 학교에 (먹는다, 잔다, 간다)	어떤 word가 같이 쓰였는가 ex) (학교, 공부) (자동차, 운전)
대표 통계량	TF-IDF	-	PMI
대표 모델	Deep Averaging Network	ELMo, GPT	Word2Vec

Bag-of-Words 가정

저자의 의도가 word 사용 여부나 그 빈도에서 드러난다고 보는 가정

TF-IDF

어떤 word의 topic 예측 능력이 강할 수록 가중치 상승

Deep Averaging Network: sentence에 속한 word의 embedding을 평균을 취해 Sentence Embedding 생성

→ 순서 고려 x, only 단어등장 유무만 판단

Language model

word의 등장 순서를 train해 주어진 word sequence가 얼마나 자연스러운지 확률 부여 (DL model)

분포 가정

word의 의미는 주변 문맥을 통해 유추해볼 수 있다고 보는 가정
PMI: 두 word(A, B)가 얼마나 자주 같이 등장하는지에 관한 정보 수치화

Word2Vec: 특정 다긴 word 주변의 문맥, 즉 분포 정보 함축

Bag-of-Words representation

Bag-of-Words(BoW)

text가 word의 bag으로써 표현

Modeling BoW

Unique words들로 Vocabulary build
각 word를 one-hot-vector로 encode
one-hot vector들의 합을 text data로 표현

TF-IDF

Bag-of-Words 가정 기반
하나의 word가 전체 document중 얼마나 중요한가를 수치화

Word-document matrix A (W x D)라 하면,

- A[w(i), d(j)]: i번 째 word가 j번 째 document에서 등장 횟수

- TF(w): overall document에서 w의 등장 횟수

- DF(w): w가 등장하는 document의 횟수

ex)

TF('is') = 3, TF('good') = 1

DF('is') = 3, DF('good') = 1

TF-IDF('is') = TF('is') x log(3/DF('is')) = 3 x log(3/3) = 0

TF-IDF('good') = 1 x log(3/1) = log 3

→ 'good'보다 'is'가 TF-IDF 값이 더 높기 때문에 document에서 더 중요한 정보를 가지고 있다.

Naive Bayes Classifier

Bayes` tehorem

모든 feature는 independent하다고 가정

Naive Bayes Classifier

Bayes` theorem을 적용한 simple probablistic classifier
모든 feature가 independent → 모든 words in sequence가 independent
각 document d에서 C classes가 부여된다고 가정

- P(c|d): probability that d belongs to c

- Bayes`theorem에서 denominator 제거 → P(c|d) = P(d|c)·P(c)

- d는 sequence of words w1, w2, ····, wn로 볼 수 있기 때문에,

→ P(d|c)·P(c) = P(w1, w2, ···, wn|c)·P(c)

Chain rule 적용:

ex) 새로운 sentence: 'you free loterry'가 spam인지 inbox인지 구분,

P(Cspam|sentence) = P(Cspam)·P(Wyou|Cspam)·P(Wfree)·P(Wlottery|Cspam) = 6/1000

P(Cinbox|sentence) = P(Cinbox)·P(Wyou|Cinbox)·P(Wfree|Cinbox)·P(Wlottery|Cinbox) = 0

→ sentence를 spam으로 분류

'NLP' 카테고리의 다른 글

Word Embedding - Word2Vec, Glove, Doc2Vec (0)	2022.06.20
Topic Modeling (0)	2022.06.17
NLP overview (0)	2022.06.16
NLP 이해하기 (0)	2022.04.07
구글 BERT의 정석 정리(수기) (0)	2022.04.07

동영`s 인공지능 공부방

Bag-of-Words