[밑바닥부터 시작하는 딥러닝 2] - 2. 자연어와 단어의 분산 표현 I

Deep Learning/from scratch II

[밑바닥부터 시작하는 딥러닝 2] - 2. 자연어와 단어의 분산 표현 I

해파리냉채무침 2024. 2. 13. 18:33

밑딥1은 대부분이 배웠던 내용이라 복습의 목적으로 일주일만에 책 한권 다했다.

밑딥2 1장은 밑딥1의 총망라라 정리하지 않았고 2장부터 시작합니다

밑딥1보다 좀 더 꼼꼼히 기록할 예정임

통계 기반 기법

말뭉치 전처리 하기

통계 기반 기법에서는 말뭉치(corpus)를 이용한다. 말뭉치란 대량의 텍스트 데이터를 의미한다.

우리가 흔히 볼 수 있는 뉴스, 소설과 같은 글도 말뭉치로 되어 있다.

다음과 같은 말뭉치로 앞으로의 분석을 진행할 것이다.

text = 'You say goodbye and I say hello.'

이 텍스트를 단어 단위로 분할한다.

txt = 'You say goodbye and I say hello.'
txt = txt.lower() #단어 소문자화
txt = txt.replace('.',' .') # .앞에 공백 추가
words = txt.split(' ') # 공백 기준으로 split
words

['you', 'say', 'goodbye', 'and', 'i', 'say', 'hello', '.']

이것을 다음과 같은 정규표현식으로도 단어를 분할 할 수 있다.

import re
txt = 'You say goodbye and I say hello.'
words = re.split('\W+', txt)
print(words)

단어를 쉽게 접근하기 위해 딕셔너리 안에 {'단어':id(숫자}, {id(숫자):'단어'} 이렇게 만들어준다

word_to_id = {}
id_to_word = {}
for word in words:#words의 단어들을 받아서
    if word not in word_to_id: #해당 단어가 word_to_id에 없으면
        new_id = len(word_to_id) #길이만큼의 new_id 지정(0부터 시작이라 5개면 index는 4에서 끝남)
        word_to_id[word] = new_id #새로운 단어에 대한 새로운 인덱스 지정
        id_to_word[new_id] = word
        
print(word_to_id)
print(id_to_word)

{'You': 0, 'say': 1, 'goodbye': 2, 'and': 3, 'I': 4, 'hello': 5, '': 6}

{0: 'You', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'I', 5: 'hello', 6: ''}

이러한 딕셔너리로 인덱스로 단어를 찾거나, 단어로 인덱스를 찾을 수 있다.

단어 목록을 단어 ID 목록으로 변환하는 과정이다

import numpy as np
corpus = [word_to_id[w] for w in words] #words에 있는 단어들을 받아 인덱스 추출
corpus = np.array(corpus)#배열 전환
print(corpus)

[0 1 2 3 4 1 5 6]

위의 코드를 모아 preprocess()라는 함수로 구현한다.

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word

    corpus = np.array([word_to_id[w] for w in words])

    return corpus, word_to_id, id_to_word

from util import preprocess
text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
print(corpus)
print(word_to_id)
print(id_to_word)

[0 1 2 3 4 1 5 6]
{'you': 0, 'say': 1, 'goodbye': 2, 'and': 3, 'i': 4, 'hello': 5, '.': 6}
{0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}

단어의 의미를 정확하게 파악할 수 있는 벡터 표현을 분산 표현이라고 한다.

분포가설은 단어 자체는 의미가 없고 단어가 사용된 맥락이 의미를 형성한다고 한다.

윈도우 크기란 맥락의 크기, 주변의 단어 갯수를 의미한다. 아래 예시는 goodbye의 윈도우 크기가 2일때 해당하는 단어들이다.

you say goodbye and i say hello.

동시발생 행렬

통계 기반 기법은 어떤 단어에 주목했을 때, 주변에 어떤 단어가 몇 번 등장하는지 집계하는 방법이다.

동시발생 행렬을 위의 예시를 통해 들어보았다.

you say goodbye and i say hello.

단어 you의 맥락 세어보기 -> [0,1,0,0,0,0,0] 과 같이 벡터로 표현할 수 있다

	you	say	goodbye	and	i	hello	.
you	0	1	0	0	0	0	0
say	1	0	1	0	1	1	0
goodbye	0	1	0	1	0	0	0
and	0	0	1	0	1	0	0
i	0	1	0	1	0	0	0
hello	0	1	0	0	0	0	1
.	0	0	0	0	0	1	0

파이썬 구현

import numpy as np
c= np.array([[0,1,0,0,0,0,0],
            [1,0,1,0,1,1,0],
            [0,1,0,1,0,0,0],
            [0,0,1,0,1,0,0],
            [0,1,0,1,0,0,0],
            [0,1,0,0,0,0,1],
            [0,0,0,0,0,1,0]],dtype=np.int32)


print(c[0]) #id가 0인것
print(c[4])# 4인단어 
print(c[word_to_id['goodbye']]) # goodbye 인덱스 벡터

[0 1 0 0 0 0 0]
[0 1 0 1 0 0 0]
[0 1 0 1 0 0 0]

동시발생 행렬 구현- util.py

def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus) #말뭉치 길이(반점 온점 포함한 단어 갯수)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)#길이만큼 0으로 채워진 2차원 배열

    for idx, word_id in enumerate(corpus): # 말뭉치 길이만큼 인덱스, 단어 부여
        for i in range(1, window_size + 1):
            left_idx = idx - i #왼쪽으로 인덱스 만큼 주변 단어 세기
            right_idx = idx + i #오른쪽 인덱스 만큼 세기 

            if left_idx >= 0: #왼쪽과 오른쪽 경계를 벗어나지 않게 함
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1 #단어 있는 해당 인덱스에 1부여

            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1

    return co_matrix

벡터 간 유사도

단어 벡터 유사도를 나타낼 때는 코사인 유사도를 이용한다.

https://github.com/deeplearningfromscratch2/deep-learning-from-scratch-2/blob/master/Ch2_dispersal_nlp/ch2_dispersal_nlp.ipynb

두 벡터 x = (x1,x2,x3..xn)과 y = (y1,y2,y3..n)이 있다면 분자에는 벡터 내적(x1y1+ x2y2+ x3y3+ x4y4..) 분모에는 각 벡터의 L2 norm이 적용된다 ( sqrt(x1^2 + x2^2..)) L2 norm은 벡터의 각 원소를 제곱해 더한 후 다시 제곱근을 구해 계산한다.

def cos_similarity(x, y, eps=1e-8):

    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

epsilon을 더해주는 이유는 0으로 나눠지는 것을 방지하기 위함이다.

'you'와 'i'의 코사인 유사도 구하기

import sys
sys.path.append('..')
from common.util import preprocess, create_co_matrix, cos_similarity


text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

c0 = C[word_to_id['you']]  #you의 단어 벡터
c1 = C[word_to_id['i']]  #i의 단어 벡터 
print(cos_similarity(c0, c1))

유사 단어의 랭킹 표시

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    # 1. 검색어를 꺼낸다.
    if query not in word_to_id: #딕셔너리에 찾는 단어가 없으면
        print('$s(을)를 찾을 수 없습니다.' %query) #찾을 수 없다고 표시
        return
    
    print ('\n[query] ' + query) 
    query_id = word_to_id[query]#검색하려는 단어 아이디
    query_vec = word_matrix[query_id] #검색하려는 단어 행렬
    
    # 2. 코사인 유사도 계산
    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size) #말뭉치 길이만큼 0으로 채워진 행렬
    for i in range(vocab_size): 
        similarity[i] = cos_similarity(word_matrix[i], query_vec) #단어별& 검색단어 각각코사인 유사도 구하기
        
    # 3. 코사인 유사도를 기준으로 내림차순으로 출력
    count = 0
    for i in (-1 * similarity).argsort(): #유사도 큰순 인덱스 나열
        if id_to_word[i] == query : #찾으려는 단어가 딕셔너리에 있으면
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i])) #단어와 코사인 유사도 도출
        
        count += 1 # 찾는단어 있으면 하나씩 세주기
        if count >= top:
            return

argsort()에 대한 예시:

import numpy as np
x = np.array([100, -20, 2])
x.argsort()

array([1, 2, 0])

음수로 argsort를 하면 다음과 같이 출력된다.

(-x).argsort()

array([0, 2, 1])

most_similar() 함수를 you와 유사한 단어를 구현하면 다음과 같다.

import sys
sys.path.append('..')
from util import preprocess, create_co_matrix, most_similar


text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
C = create_co_matrix(corpus, vocab_size)

most_similar('you', word_to_id, id_to_word, C, top=5)

[query] you
goodbye: 0.7071067691154799
i: 0.7071067691154799
hello: 0.7071067691154799
say: 0.0