728x90

들어가며

학습 목표

텍스트 데이터를 머신러닝 입출력용 수치데이터로 변환하는 과정을 이해한다.
RNN의 특징을 이해하고 시퀀셜한 데이터를 다루는 방법을 이해한다.
1-D CNN으로도 텍스트를 처리할 수 있음을 이해한다.
IMDB와 네이버 영화리뷰 데이터셋을 이용한 영화리뷰 감성 분류 실습을 진행한다.

텍스트 감정분석의 유용성

Q1. 텍스트 데이터에서만 얻을 수 있는 유용한 정보는? 그 유용성은 덱스트 데이터의 어떤 특징에서 비롯되는가?
SNS 등에서 광범위한 분량의 텍스트 데이터를 쉽게 얻을 수 있는데, 이 데이터는 소비자들의 개인적, 감성적 반응이 직접 담겨 있을뿐더러 실시간 트렌드를 빠르게 반영하는 데이터이기도 하다

Q2. 텍스트 감성분석 접근법을 크게 2가지로 나누면 무엇이 있는가?
기계학습 기반 접근법과 감성사전 기반 접근법

Q3. 사전 기반의 감성분석이 기계학습 기반 접근법 대비 가지는 한계점?

분석 대상에 따라 단어의 감성 점수가 달라질 수 있다는 가능성에 대응하기 어렵다.
단순 긍부정을 넘어서 긍부정의 원인이 되는 대상 속성 기반의 감성 분석이 어렵다.

Q4. 감성분석 등 텍스트 분류 모델이 다른 데이터분석 업무에 어떤 도움을 줄 수 있는가?
일반적인 데이터분석 업무는 범주화가 잘 된 정형데이터를 필요로 하는데, 이런 데이터를 큰 규모로 구축하기 위해서 많은 비용이 들지만, 쉽게 구할 수 있는 비정형데이터인 텍스트에 감성분석 기법을 적용하면 텍스트를 정형데이터로 가공하여 유용한 의사결정 보조자료로 활용할 수 있게 된다.

Q5. 라벨링 비용이 많이 드는 머신러닝 기반 감성분석의 비용을 절감하면서 정확도를 크게 향상시킬 수 있는 자연어처리 기법에는 무엇이 있는가?
단어의 특성을 저차원 벡터값으로 표현할 수 있는 워드 임베딩(word embedding) 기법

텍스트 데이터의 특징

숫자 분류기 모델에 없었던 2가지 문제가 생긴다.

텍스트를 어떻게 숫자 행렬로 표현할 수 있나요?
텍스트에는 순서가 중요합니다. 입력 데이터의 순서를 인공지능 모델에 어떻게 반영해야 하나요?

텍스트 데이터의 특징 (1) 텍스트를 숫자로 표현하는 방법

텍스트의 중요한 특징은, 이들은 그 자체로는 기호일 뿐이며, 텍스트가 내포하는 의미를 기호가 직접 내포하지는 않는다는 점이다!

단어와 그 단어의 의미를 나타내는 벡터를 짝지어보자

index_to_word={}  # 빈 딕셔너리를 만들어서

# 단어들을 하나씩 채워 봅니다. 채우는 순서는 일단 임의로 하였습니다. 그러나 사실 순서는 중요하지 않습니다. 
# <BOS>, <PAD>, <UNK>는 관례적으로 딕셔너리 맨 앞에 넣어줍니다. 
index_to_word[0]='<PAD>'  # 패딩용 단어
index_to_word[1]='<BOS>'  # 문장의 시작지점
index_to_word[2]='<UNK>'  # 사전에 없는(Unknown) 단어
index_to_word[3]='i'
index_to_word[4]='feel'
index_to_word[5]='hungry'
index_to_word[6]='eat'
index_to_word[7]='lunch'
index_to_word[8]='now'
index_to_word[9]='happy'

print(index_to_word)

{0: '<PAD>', 1: '<BOS>', 2: '<UNK>', 3: 'i', 4: 'feel', 5: 'hungry', 6: 'eat', 7: 'lunch', 8: 'now', 9: 'happy'}

word_to_index={word:index for index, word in index_to_word.items()}
print(word_to_index)

{'<PAD>': 0, '<BOS>': 1, '<UNK>': 2, 'i': 3, 'feel': 4, 'hungry': 5, 'eat': 6, 'lunch': 7, 'now': 8, 'happy': 9}

# 문장 1개를 활용할 딕셔너리와 함께 주면, 단어 인덱스 리스트로 변환해 주는 함수를 만들어 봅시다.
# 단, 모든 문장은 <BOS>로 시작하는 것으로 합니다. 
def get_encoded_sentence(sentence, word_to_index):
    return [word_to_index['<BOS>']]+[word_to_index[word] if word in word_to_index else word_to_index['<UNK>'] for word in sentence.split()]

print(get_encoded_sentence('i eat lunch', word_to_index))

[1, 3, 6, 7]

# 여러 개의 문장 리스트를 한꺼번에 숫자 텐서로 encode해 주는 함수입니다. 
def get_encoded_sentences(sentences, word_to_index):
    return [get_encoded_sentence(sentence, word_to_index) for sentence in sentences]

# sentences=['i feel hungry', 'i eat lunch', 'now i feel happy'] 가 아래와 같이 변환됩니다. 
encoded_sentences = get_encoded_sentences(sentences, word_to_index)
print(encoded_sentences)

[[1, 3, 4, 5], [1, 3, 6, 7], [1, 8, 3, 4, 9]]

# 숫자 벡터로 encode된 문장을 원래대로 decode하는 함수입니다. 
def get_decoded_sentence(encoded_sentence, index_to_word):
    return ' '.join(index_to_word[index] if index in index_to_word else '<UNK>' for index in encoded_sentence[1:])  #[1:]를 통해 <BOS>를 제외

print(get_decoded_sentence([1, 3, 4, 5], index_to_word))

i feel hungry

# 숫자 벡터로 encode된 문장을 원래대로 decode하는 함수입니다. 
def get_decoded_sentence(encoded_sentence, index_to_word):
    return ' '.join(index_to_word[index] if index in index_to_word else '<UNK>' for index in encoded_sentence[1:])  
#[1:]를 통해 <BOS>를 제외

print(get_decoded_sentence([1, 3, 4, 5], index_to_word))

i feel hungry

# 여러 개의 숫자 벡터로 encode된 문장을 한꺼번에 원래대로 decode하는 함수입니다. 
def get_decoded_sentences(encoded_sentences, index_to_word):
    return [get_decoded_sentence(encoded_sentence, index_to_word) for encoded_sentence in encoded_sentences]

# encoded_sentences=[[1, 3, 4, 5], [1, 3, 6, 7], [1, 8, 3, 4, 9]] 가 아래와 같이 변환됩니다.
print(get_decoded_sentences(encoded_sentences, index_to_word))

['i feel hungry', 'i eat lunch', 'now i feel happy']

텍스트 데이터의 특징 (2) Embedding 레이어의 등장

텍스트가 숫자로 변환되어 인공지능 모델의 입력으로 사용될 수 있게 되었지만, 아직 벡터는 텍스트에 담긴 언어의 의미와 대응되는 벡터가 아니다!!

따라서, 단어의 의미를 나타내는 벡터를 훈련가능한 파라미터로 놓고, 이를 딥러닝을 통해 학습/최적화하자. Tensorflow, Pytorch 등의 딥러닝 프레임워크들은 이런 Embedding layer를 사용한다.

[임베딩 레이어를 통해 word가 벡터화되는 과정]

출처: https://wikidocs.net/64779

위 그림에서 'great' 단어의 의미 공간상의 워드벡터는 Lookup table 형태로 구성된 Embedding layer의 1919번째 벡터가 된다. Embdding layer를 활용하여 이전 스텝의 텍스트 데이터를 워드 벡터 텐서 형태로 표현해보자.

# 아래 코드는 그대로 실행하시면 ### 에러가 발생할 것입니다. ###

import numpy as np
import tensorflow as tf
import os
'''
vocab_size = len(word_to_index)  # 위 예시에서 딕셔너리에 포함된 단어 개수는 10
word_vector_dim = 4    # 위 그림과 같이 4차원의 워드 벡터를 가정합니다. 

embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=word_vector_dim, mask_zero=True)

# 숫자로 변환된 텍스트 데이터 [[1, 3, 4, 5], [1, 3, 6, 7], [1, 8, 3, 4, 9]] 에 Embedding 레이어를 적용합니다. 
raw_inputs = np.array(get_encoded_sentences(sentences, word_to_index), dtype='object')
output = embedding(raw_inputs)
print(output)
'''

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

/tmp/ipykernel_140/2733197300.py in <module>
     12 # 숫자로 변환된 텍스트 데이터 [[1, 3, 4, 5], [1, 3, 6, 7], [1, 8, 3, 4, 9]] 에 Embedding 레이어를 적용합니다.
     13 raw_inputs = np.array(get_encoded_sentences(sentences, word_to_index), dtype='object')
---> 14 output = embedding(raw_inputs)
     15 print(output)


/opt/conda/lib/python3.9/site-packages/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    983     if any(isinstance(x, (
    984         tf.Tensor, np.ndarray, float, int)) for x in input_list):
--> 985       inputs = tf.nest.map_structure(_convert_numpy_or_python_types, inputs)
    986       input_list = tf.nest.flatten(inputs)
    987 


/opt/conda/lib/python3.9/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
    867 
    868   return pack_sequence_as(
--> 869       structure[0], [func(*x) for x in entries],
    870       expand_composites=expand_composites)
    871 


/opt/conda/lib/python3.9/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
    867 
    868   return pack_sequence_as(
--> 869       structure[0], [func(*x) for x in entries],
    870       expand_composites=expand_composites)
    871 


/opt/conda/lib/python3.9/site-packages/keras/engine/base_layer.py in _convert_numpy_or_python_types(x)
   3297 def _convert_numpy_or_python_types(x):
   3298   if isinstance(x, (tf.Tensor, np.ndarray, float, int)):
-> 3299     return tf.convert_to_tensor(x)
   3300   return x
   3301 


/opt/conda/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2_with_dispatch(value, dtype, dtype_hint, name)
   1428     ValueError: If the `value` is a tensor not of given `dtype` in graph mode.
   1429   """
-> 1430   return convert_to_tensor_v2(
   1431       value, dtype=dtype, dtype_hint=dtype_hint, name=name)
   1432 


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
   1434 def convert_to_tensor_v2(value, dtype=None, dtype_hint=None, name=None):
   1435   """Converts the given `value` to a `Tensor`."""
-> 1436   return convert_to_tensor(
   1437       value=value,
   1438       dtype=dtype,


/opt/conda/lib/python3.9/site-packages/tensorflow/python/profiler/trace.py in wrapped(*args, **kwargs)
    161         with Trace(trace_name, **trace_kwargs):
    162           return func(*args, **kwargs)
--> 163       return func(*args, **kwargs)
    164 
    165     return wrapped


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1564 
   1565     if ret is None:
-> 1566       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1567 
   1568     if ret is NotImplemented:


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***)
     50 def _default_conversion_function(value, dtype, name, as_ref):
     51   del as_ref  # Unused.
---> 52   return constant_op.constant(value, dtype, name=name)
     53 
     54 


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)
    273 


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    281       with trace.Trace("tf.constant"):
    282         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 283     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    284 
    285   g = ops.get_default_graph()


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    306 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
    307   """Creates a constant on the current device."""
--> 308   t = convert_to_eager_tensor(value, ctx, dtype)
    309   if shape is None:
    310     return t


/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
    104       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    105   ctx.ensure_initialized()
--> 106   return ops.EagerTensor(value, ctx.device_name, dtype)
    107 
    108 


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

에러가 발생한다. 왜일까?

Embedding layer의 input이 되는 문장 벡터는 그 길이가 일정해야 한다. 하지만, raw_input의 3개 벡터의 길이는 각각 4, 4, 5이다.

텐서플로우에서는 tf.keras.preprocessing.sequence.pad_sequence 함수를 통해 문장 벡터 뒤에 <PAD>를 추가하여 길이를 일정하게 맞춰준다.

raw_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs,
                                                       value=word_to_index['<PAD>'],
                                                       padding='post',
                                                       maxlen=5)
print(raw_inputs)

[[1 3 4 5 0]
 [1 3 6 7 0]
 [1 8 3 4 9]]

패딩이 잘 되었음을 확인할 수 있다.

이제 다시 위의 코드를 시도해보자.

vocab_size = len(word_to_index)  # 위 예시에서 딕셔너리에 포함된 단어 개수는 10
word_vector_dim = 4    # 그림과 같이 4차원의 워드 벡터를 가정합니다.

embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=word_vector_dim, mask_zero=True)

# tf.keras.preprocessing.sequence.pad_sequences를 통해 word vector를 모두 일정 길이로 맞춰주어야 
# embedding 레이어의 input이 될 수 있음에 주의해 주세요. 
raw_inputs = np.array(get_encoded_sentences(sentences, word_to_index), dtype=object)
raw_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs,
                                                       value=word_to_index['<PAD>'],
                                                       padding='post',
                                                       maxlen=5)
output = embedding(raw_inputs)
print(output)

tf.Tensor(
[[[-0.03637924  0.04912642  0.00614463  0.00387634]
  [-0.00140337 -0.03241007  0.03887602  0.04927981]
  [ 0.0264382  -0.03768682 -0.01645174 -0.04768963]
  [-0.01311277 -0.00982066  0.04899922  0.00553235]
  [ 0.02953488 -0.03833167  0.04791811  0.02562443]]

 [[-0.03637924  0.04912642  0.00614463  0.00387634]
  [-0.00140337 -0.03241007  0.03887602  0.04927981]
  [ 0.02158741  0.01859614 -0.00244957  0.00023383]
  [-0.01750333  0.04919222 -0.01607386 -0.04249498]
  [ 0.02953488 -0.03833167  0.04791811  0.02562443]]

 [[-0.03637924  0.04912642  0.00614463  0.00387634]
  [-0.04716518  0.00073721 -0.034024   -0.02084566]
  [-0.00140337 -0.03241007  0.03887602  0.04927981]
  [ 0.0264382  -0.03768682 -0.01645174 -0.04768963]
  [-0.03882226 -0.03944799  0.03558597 -0.00590547]]], shape=(3, 5, 4), dtype=float32)

여기서 3은 입력문장 개수, 5는 입력문장의 최대 길이, 4는 워드 벡터의 차원 수를 의미한다.

시퀀스 데이터를 다루는 RNN

RNN은 시간 흐름에 따라 들어오는 입력에 따라 변하는 현재 상태를 묘사하는 state machine으로 설계되어있다.

다음 그림을 이해해보자.

State가 유지된다는 것의 의미

Stateful한 대화에서는 손님이 이전 시점에 어떤 선택을 했는지 직원이 기억을 하지만, Stateless한 대화에서는 직원이 기억하지 못한다. 그래서 손님 스스로 본인이 이전 시점에 했던 선택을 모두 기억하고 있다가 직원에게 매번 새롭게 전달해야 한다. 손님의 이전 주문내역을 기억하는 직원은 stateful하고, 그렇지 못한 직원은 stateless하다.

다음 영상을 보며 RNN 기본 개념과 구조를 구체적으로 확인해보자.
김성훈 교수의 모두의 딥러닝 강좌 12강.RNN

NN의 꽃 RNN이야기

vocab_size = 10  # 어휘 사전의 크기입니다(10개의 단어)
word_vector_dim = 4  # 단어 하나를 표현하는 임베딩 벡터의 차원수입니다. 

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, word_vector_dim, input_shape=(None,)))
model.add(tf.keras.layers.LSTM(8))   # 가장 널리 쓰이는 RNN인 LSTM 레이어를 사용하였습니다. 이때 LSTM state 벡터의 차원수는 8로 하였습니다. (변경 가능)
model.add(tf.keras.layers.Dense(8, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))  # 최종 출력은 긍정/부정을 나타내는 1dim 입니다.

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 4)           40        
_________________________________________________________________
lstm (LSTM)                  (None, 8)                 416       
_________________________________________________________________
dense (Dense)                (None, 8)                 72        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 9         
=================================================================
Total params: 537
Trainable params: 537
Non-trainable params: 0
_________________________________________________________________

꼭 RNN이어야 할까?

텍스트를 처리하기 위해 RNN이 아닌 1-D Convolution Neural Network(1-D CNN)를 사용할 수도 있다. 이전에 이미지 분류기를 구현할 때 2-D CNN을 사용해 보았을 것이다.

1-D CNN은 문자 전체를 한번에 한 방향으로 길이 7까지 필터로 스캐닝하면서 7단어 내에서 발견되는 특징을 추출하여 그것으로 문장을 분류한다. 이 방식 또한 RNN못지 않은 효율을 보인다.

또한, 학습속도 또한 훨씬 빠른데, 그 이유는 RNN보다 CNN 계열의 병렬처리가 효율적이기 때문이다.

vocab_size = 10  # 어휘 사전의 크기입니다(10개의 단어)
word_vector_dim = 4   # 단어 하나를 표현하는 임베딩 벡터의 차원 수입니다. 

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, word_vector_dim, input_shape=(None,)))
model.add(tf.keras.layers.Conv1D(16, 7, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(5))
model.add(tf.keras.layers.Conv1D(16, 7, activation='relu'))
model.add(tf.keras.layers.GlobalMaxPooling1D())
model.add(tf.keras.layers.Dense(8, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))  # 최종 출력은 긍정/부정을 나타내는 1dim 입니다.

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 4)           40        
_________________________________________________________________
conv1d (Conv1D)              (None, None, 16)          464       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 16)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 16)          1808      
_________________________________________________________________
global_max_pooling1d (Global (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
=================================================================
Total params: 2,457
Trainable params: 2,457
Non-trainable params: 0
_________________________________________________________________

# 혹은 이렇게 GlobalMaxPooling1D() lyaer 하나만 사용하는 방법도 있다.

vocab_size = 10  # 어휘 사전의 크기입니다(10개의 단어)
word_vector_dim = 4   # 단어 하나를 표현하는 임베딩 벡터의 차원 수입니다. 

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, word_vector_dim, input_shape=(None,)))
model.add(tf.keras.layers.GlobalMaxPooling1D())
model.add(tf.keras.layers.Dense(8, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))  # 최종 출력은 긍정/부정을 나타내는 1dim 입니다.

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (None, None, 4)           40        
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 4)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 9         
=================================================================
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________

IMDB 영화리뷰 감성분석 (1) IMDB 데이터셋 분석

이제 IMDb 영화비류 감성분석을 해보자. 데이터셋은 50000개(테스트 25000, 훈련 25000)의 영어의 영화 리뷰 텍스트가 있으며, 긍정/부정에 대해 라벨이 달려있다.

imdb = tf.keras.datasets.imdb

# IMDb 데이터셋 다운로드 
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)
print("훈련 샘플 개수: {}, 테스트 개수: {}".format(len(x_train), len(x_test)))

훈련 샘플 개수: 25000, 테스트 개수: 25000

imdb.load_data() 호출 시 단어사전에 등재할 단어의 개수(num_words)를 10000으로 지정하면, 그 개수만큼읜 word_to_index 딕셔너리까지 생성된 형태로 데이터셋이 생성된다.

다운로드한 데이터 예시는 아래와 같다.

print(x_train[0])  # 1번째 리뷰데이터
print('라벨: ', y_train[0])  # 1번째 리뷰데이터의 라벨
print('1번째 리뷰 문장 길이: ', len(x_train[0]))
print('2번째 리뷰 문장 길이: ', len(x_train[1]))

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
라벨:  1
1번째 리뷰 문장 길이:  218
2번째 리뷰 문장 길이:  189

텍스트가 아니라 이미 숫자로 encode된 텍스트 데이터를 다운로드했음을 확인할 수 있다. 이미 텍스트가 encode되었으므로 IMDb 데이터셋에는 encode에 사용한 딕셔너리도 함께 제공한다.

word_to_index = imdb.get_word_index()
index_to_word = {index:word for word, index in word_to_index.items()}
print(index_to_word[1])     # 'the' 가 출력됩니다. 
print(word_to_index['the'])  # 1 이 출력됩니다.

the
1

IMDb 데이터셋 인고딩을 위해서는 word_to_index, index_to_word는 아래와 같이 보정돼야 한다. word_to_index는 IMDb 텍스트 데이터셋의 단어 출현 빈도를 기준으로 내림차순 정렬되어있다.

#실제 인코딩 인덱스는 제공된 word_to_index에서 index 기준으로 3씩 뒤로 밀려 있습니다.  
word_to_index = {k:(v+3) for k,v in word_to_index.items()}

# 처음 몇 개 인덱스는 사전에 정의되어 있습니다
word_to_index["<PAD>"] = 0
word_to_index["<BOS>"] = 1
word_to_index["<UNK>"] = 2  # unknown
word_to_index["<UNUSED>"] = 3

index_to_word = {index:word for word, index in word_to_index.items()}

print(index_to_word[1])     # '<BOS>' 가 출력됩니다. 
print(word_to_index['the'])  # 4 이 출력됩니다. 
print(index_to_word[4])     # 'the' 가 출력됩니다.

<BOS>
4
the

다운로드한 데이터셋이 확인되었다. 텍스트가 정상적으로 decode되는지 보자.

print(get_decoded_sentence(x_train[0][:20], index_to_word))
print('라벨: ', y_train[0])  # 1번째 리뷰데이터의 라벨

this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you
라벨:  1

pad_sequence를 통해 데이터상 문장의 길이를 통일하는 것도 잊지말자!
maxlen 값 설정도 모델 성능에 영향을 미친다. 이 값도 적절하게 설정하기 위해선 전체 데이터셋의 분포를 확인해야 한다.

total_data_text = list(x_train) + list(x_test)
# 텍스트데이터 문장길이의 리스트를 생성한 후
num_tokens = [len(tokens) for tokens in total_data_text]
num_tokens = np.array(num_tokens)
# 문장길이의 평균값, 최대값, 표준편차를 계산해 본다. 
print('문장길이 평균 : ', np.mean(num_tokens))
print('문장길이 최대 : ', np.max(num_tokens))
print('문장길이 표준편차 : ', np.std(num_tokens))

# 예를들어, 최대 길이를 (평균 + 2*표준편차)로 한다면,  
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
maxlen = int(max_tokens)
print('pad_sequences maxlen : ', maxlen)
print('전체 문장의 {}%가 maxlen 설정값 이내에 포함됩니다. '.format( 100*np.sum(num_tokens < max_tokens) / len(num_tokens)))

문장길이 평균 :  234.75892
문장길이 최대 :  2494
문장길이 표준편차 :  172.91149458735703
pad_sequences maxlen :  580
전체 문장의 94.536%가 maxlen 설정값 이내에 포함됩니다.

여기서 유의할 점은 padding 방식(post,pre)에 따라 모델의 성능차이가 발생한다.

두 방식을 모두 사용해서 결과를 비교해보자.

x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
                                                        value=word_to_index["<PAD>"],
                                                        padding='pre', # 혹은 'pre'
                                                        maxlen=maxlen)

x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test,
                                                       value=word_to_index["<PAD>"],
                                                       padding='pre', # 혹은 'pre'
                                                       maxlen=maxlen)

print(x_train.shape)

(25000, 580)

Post vs Pre

RNN은 입력데이터가 순차적으로 처리되어, 가장 마지막 입력이 최종 state 값에 가장 영향을 많이 미치게 된다. 그러므로 마지막 입력이 무의미한 padding으로 채워지는 것은 비효율적이다. 따라서 'pre'가 훨씬 유리하며, 10% 이상의 테스트 성능 차이를 보이게 된다.

IMDB 영화리뷰 감성분석 (2) 딥러닝 모델 설계와 훈련

모델을 직접 설계해보자.

vocab_size = 10000    # 어휘 사전의 크기입니다(10,000개의 단어)
word_vector_dim = 16  # 워드 벡터의 차원 수 (변경 가능한 하이퍼파라미터)

# model 설계 - 딥러닝 모델 코드를 직접 작성해 주세요.
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, word_vector_dim, input_shape=(None,)))
model.add(tf.keras.layers.GlobalMaxPooling1D())
model.add(tf.keras.layers.Dense(8, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))  # 최종 출력은 긍정/부정을 나타내는 1dim 입니다.

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 16)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 9         
=================================================================
Total params: 160,145
Trainable params: 160,145
Non-trainable params: 0
_________________________________________________________________

이 중 10000개만 validation set으로 사용하자.

# validation set 10000건 분리
x_val = x_train[:10000]   
y_val = y_train[:10000]

# validation set을 제외한 나머지 15000건
partial_x_train = x_train[10000:]  
partial_y_train = y_train[10000:]

print(partial_x_train.shape)
print(partial_y_train.shape)

(15000, 580)
(15000,)

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

epochs=20  # 몇 epoch를 훈련하면 좋을지 결과를 보면서 바꾸어 봅시다. 

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=epochs,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

Epoch 1/20
30/30 [==============================] - 9s 12ms/step - loss: 0.6902 - accuracy: 0.5633 - val_loss: 0.6846 - val_accuracy: 0.7483
Epoch 2/20
30/30 [==============================] - 0s 8ms/step - loss: 0.6774 - accuracy: 0.7771 - val_loss: 0.6701 - val_accuracy: 0.7664
Epoch 3/20
30/30 [==============================] - 0s 7ms/step - loss: 0.6563 - accuracy: 0.8076 - val_loss: 0.6452 - val_accuracy: 0.7893
Epoch 4/20
30/30 [==============================] - 0s 6ms/step - loss: 0.6213 - accuracy: 0.8233 - val_loss: 0.6060 - val_accuracy: 0.8057
Epoch 5/20
30/30 [==============================] - 0s 6ms/step - loss: 0.5691 - accuracy: 0.8353 - val_loss: 0.5533 - val_accuracy: 0.8177
Epoch 6/20
30/30 [==============================] - 0s 6ms/step - loss: 0.5058 - accuracy: 0.8493 - val_loss: 0.4970 - val_accuracy: 0.8255
Epoch 7/20
30/30 [==============================] - 0s 6ms/step - loss: 0.4416 - accuracy: 0.8611 - val_loss: 0.4474 - val_accuracy: 0.8317
Epoch 8/20
30/30 [==============================] - 0s 6ms/step - loss: 0.3862 - accuracy: 0.8738 - val_loss: 0.4098 - val_accuracy: 0.8391
Epoch 9/20
30/30 [==============================] - 0s 6ms/step - loss: 0.3416 - accuracy: 0.8829 - val_loss: 0.3823 - val_accuracy: 0.8434
Epoch 10/20
30/30 [==============================] - 0s 7ms/step - loss: 0.3063 - accuracy: 0.8932 - val_loss: 0.3644 - val_accuracy: 0.8479
Epoch 11/20
30/30 [==============================] - 0s 6ms/step - loss: 0.2779 - accuracy: 0.9013 - val_loss: 0.3525 - val_accuracy: 0.8508
Epoch 12/20
30/30 [==============================] - 0s 6ms/step - loss: 0.2546 - accuracy: 0.9095 - val_loss: 0.3447 - val_accuracy: 0.8537
Epoch 13/20
30/30 [==============================] - 0s 6ms/step - loss: 0.2342 - accuracy: 0.9177 - val_loss: 0.3398 - val_accuracy: 0.8553
Epoch 14/20
30/30 [==============================] - 0s 6ms/step - loss: 0.2164 - accuracy: 0.9260 - val_loss: 0.3368 - val_accuracy: 0.8563
Epoch 15/20
30/30 [==============================] - 0s 6ms/step - loss: 0.2005 - accuracy: 0.9311 - val_loss: 0.3357 - val_accuracy: 0.8563
Epoch 16/20
30/30 [==============================] - 0s 6ms/step - loss: 0.1858 - accuracy: 0.9381 - val_loss: 0.3354 - val_accuracy: 0.8573
Epoch 17/20
30/30 [==============================] - 0s 6ms/step - loss: 0.1725 - accuracy: 0.9431 - val_loss: 0.3361 - val_accuracy: 0.8564
Epoch 18/20
30/30 [==============================] - 0s 6ms/step - loss: 0.1602 - accuracy: 0.9495 - val_loss: 0.3373 - val_accuracy: 0.8554
Epoch 19/20
30/30 [==============================] - 0s 6ms/step - loss: 0.1489 - accuracy: 0.9537 - val_loss: 0.3398 - val_accuracy: 0.8569
Epoch 20/20
30/30 [==============================] - 0s 6ms/step - loss: 0.1384 - accuracy: 0.9581 - val_loss: 0.3426 - val_accuracy: 0.8543

## 모델 평가
results = model.evaluate(x_test,  y_test, verbose=2)

print(results)

782/782 - 1s - loss: 0.3674 - accuracy: 0.8418
[0.36735230684280396, 0.8417999744415283]

model fit 과정 중의 train/valdation loss, accuracy 등이 매 epoch마다 history 변수에 저장된다. 이 데이터를 그래프로 그려보면, 수행했던 딥러닝 학습에 대한 아이디어를 얻을 수 있다!!

history_dict = history.history
print(history_dict.keys()) # epoch에 따른 그래프를 그려볼 수 있는 항목들

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

import matplotlib.pyplot as plt

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo"는 "파란색 점"입니다
plt.plot(epochs, loss, 'bo', label='Training loss')

# b는 "파란 실선"입니다
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

png

Training and Validation loss(혹은 acc)를 그려보면, 몇 epoch까지 트레이닝이 적합한지 최적점을 추정할 수 있다.

plt.clf()   # 그림을 초기화합니다

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

png

IMDB 영화리뷰 감성분석 (3) Word2Vec의 적용

방금 사용한 model의 첫 레이어는 Word Embedding이었다. 이 레이어는 우리가 가진 사전의 단어 개수 X 워드 벡터 사이즈만큼의 크기를 가진 학습 파라미터였다. 만약 우리의 감성 분류 모델이 학습이 잘 되었다면, Embedding 레이어에 학습된 우리 워드벡터들도 의미 공간상에 유의미한 형태로 학습되었을 것이다.

먼저, 이번 스텝에서 워드 벡터 파일을 저장할 디렉터리를 생성하고, 워드벡터를 다루는 데 유용한 gensim패키지를 확인하자.

'''
$ mkdir -p ~/aiffel/sentiment_classification/data
$ pip list | grep gensim
'''
print()

embedding_layer = model.layers[0]
weights = embedding_layer.get_weights()[0]
print(weights.shape)    # shape: (vocab_size, embedding_dim)

(10000, 16)

# 학습한 Embedding 파라미터를 파일에 써서 저장합니다. 
word2vec_file_path = os.getenv('HOME')+'/aiffel/sentiment_classification/data/word2vec.txt'
f = open(word2vec_file_path, 'w')
f.write('{} {}\n'.format(vocab_size-4, word_vector_dim))  
    # 몇개의 벡터를 얼마 사이즈로 기재할지 타이틀을 씁니다.

# 단어 개수(에서 특수문자 4개는 제외하고)만큼의 워드 벡터를 파일에 기록합니다. 
vectors = model.get_weights()[0]
for i in range(4,vocab_size):
    f.write('{} {}\n'.format(index_to_word[i], ' '.join(map(str, list(vectors[i, :])))))
f.close()

gensim에서 제공하는 패키지를 이용해, 위에 남긴 임베딩 파라미터를 읽어서 word vector로 활용할 수 있다.

from gensim.models.keyedvectors import Word2VecKeyedVectors

word_vectors = Word2VecKeyedVectors.load_word2vec_format(word2vec_file_path, binary=False)
vector = word_vectors['computer']
vector

array([-0.04758478, -0.02796364, -0.0180632 , -0.02547316, -0.0224184 ,
       -0.01944231, -0.04356909, -0.0224437 , -0.01326777, -0.03108863,
       -0.03746035, -0.01679993, -0.02364611, -0.01736356, -0.03388944,
       -0.02304549], dtype=float32)

위 워드 벡터로 실험을 해보자. 워드 벡터가 의미 공간상에 유의미하게 학습되었는지 보려면, 단어를 하나 주고 그와 유사한 단어와 그 유사도를 확인하는 방법이 있다. gensim을 활용하자.

word_vectors.similar_by_word("love")

[('finest', 0.8482679128646851),
 ('june', 0.8321837186813354),
 ('kay', 0.8232288360595703),
 ('seattle', 0.820635199546814),
 ('cynicism', 0.8101963996887207),
 ('made', 0.808197557926178),
 ('arthur', 0.8075181245803833),
 ('were', 0.8072322010993958),
 ('catch', 0.8070068359375),
 ('19', 0.8067129850387573)]

그리 유사한 단어를 잘 찾은 것 같진 않다. 우리가 다룬 훈련 데이터로는 워드 벡터를 정교하게 학습시키긴 어렵다.

그래서 이번엔 구글에서 제공하는 Word2Vec이라는 사전학습된 워드 임베딩 모델을 가져다 활용하자. Word2Vec는 1억 개의 단어로 구성된 Google New dataset을 바탕으로 학습되었다. 총 300만 개의 단어를 300차원의 벡터로 표현했다.

왜 사전학습된 임베딩을 활용하는 게 유리할까? 바로 전이학습 때문이다.

한국어 임베딩 서문

이제 본격적으로 Google의 Word2Vec 모델을 가져와보자.

'''
ln -s ~/data/GoogleNews-vectors-negative300.bin.gz ~/aiffel/sentiment_classification/data
'''
print()

from gensim.models import KeyedVectors

word2vec_path = os.getenv('HOME')+'/aiffel/sentiment_classification/data/GoogleNews-vectors-negative300.bin.gz'
word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True, limit=1000000)
vector = word2vec['computer']
vector     # 무려 300dim의 워드 벡터입니다.

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -1.77734375e-01, -2.08984375e-01,  1.76757812e-01,
        2.38037109e-02, -2.57812500e-01, -4.46777344e-02,  1.88476562e-01,
        5.51757812e-02,  5.02929688e-02, -1.06933594e-01,  1.89453125e-01,
       -1.16210938e-01,  8.49609375e-02, -1.71875000e-01,  2.45117188e-01,
       -1.73828125e-01, -8.30078125e-03,  4.56542969e-02, -1.61132812e-02,
        1.86523438e-01, -6.05468750e-02, -4.17480469e-02,  1.82617188e-01,
        2.20703125e-01, -1.22558594e-01, -2.55126953e-02, -3.08593750e-01,
        9.13085938e-02,  1.60156250e-01,  1.70898438e-01,  1.19628906e-01,
        7.08007812e-02, -2.64892578e-02, -3.08837891e-02,  4.06250000e-01,
       -1.01562500e-01,  5.71289062e-02, -7.26318359e-03, -9.17968750e-02,
       -1.50390625e-01, -2.55859375e-01,  2.16796875e-01, -3.63769531e-02,
        2.24609375e-01,  8.00781250e-02,  1.56250000e-01,  5.27343750e-02,
        1.50390625e-01, -1.14746094e-01, -8.64257812e-02,  1.19140625e-01,
       -7.17773438e-02,  2.73437500e-01, -1.64062500e-01,  7.29370117e-03,
        4.21875000e-01, -1.12792969e-01, -1.35742188e-01, -1.31835938e-01,
       -1.37695312e-01, -7.66601562e-02,  6.25000000e-02,  4.98046875e-02,
       -1.91406250e-01, -6.03027344e-02,  2.27539062e-01,  5.88378906e-02,
       -3.24218750e-01,  5.41992188e-02, -1.35742188e-01,  8.17871094e-03,
       -5.24902344e-02, -1.74713135e-03, -9.81445312e-02, -2.86865234e-02,
        3.61328125e-02,  2.15820312e-01,  5.98144531e-02, -3.08593750e-01,
       -2.27539062e-01,  2.61718750e-01,  9.86328125e-02, -5.07812500e-02,
        1.78222656e-02,  1.31835938e-01, -5.35156250e-01, -1.81640625e-01,
        1.38671875e-01, -3.10546875e-01, -9.71679688e-02,  1.31835938e-01,
       -1.16210938e-01,  7.03125000e-02,  2.85156250e-01,  3.51562500e-02,
       -1.01562500e-01, -3.75976562e-02,  1.41601562e-01,  1.42578125e-01,
       -5.68847656e-02,  2.65625000e-01, -2.09960938e-01,  9.64355469e-03,
       -6.68945312e-02, -4.83398438e-02, -6.10351562e-02,  2.45117188e-01,
       -9.66796875e-02,  1.78222656e-02, -1.27929688e-01, -4.78515625e-02,
       -7.26318359e-03,  1.79687500e-01,  2.78320312e-02, -2.10937500e-01,
       -1.43554688e-01, -1.27929688e-01,  1.73339844e-02, -3.60107422e-03,
       -2.04101562e-01,  3.63159180e-03, -1.19628906e-01, -6.15234375e-02,
        5.93261719e-02, -3.23486328e-03, -1.70898438e-01, -3.14941406e-02,
       -8.88671875e-02, -2.89062500e-01,  3.44238281e-02, -1.87500000e-01,
        2.94921875e-01,  1.58203125e-01, -1.19628906e-01,  7.61718750e-02,
        6.39648438e-02, -4.68750000e-02, -6.83593750e-02,  1.21459961e-02,
       -1.44531250e-01,  4.54101562e-02,  3.68652344e-02,  3.88671875e-01,
        1.45507812e-01, -2.55859375e-01, -4.46777344e-02, -1.33789062e-01,
       -1.38671875e-01,  6.59179688e-02,  1.37695312e-01,  1.14746094e-01,
        2.03125000e-01, -4.78515625e-02,  1.80664062e-02, -8.54492188e-02,
       -2.48046875e-01, -3.39843750e-01, -2.83203125e-02,  1.05468750e-01,
       -2.14843750e-01, -8.74023438e-02,  7.12890625e-02,  1.87500000e-01,
       -1.12304688e-01,  2.73437500e-01, -3.26171875e-01, -1.77734375e-01,
       -4.24804688e-02, -2.69531250e-01,  6.64062500e-02, -6.88476562e-02,
       -1.99218750e-01, -7.03125000e-02, -2.43164062e-01, -3.66210938e-02,
       -7.37304688e-02, -1.77734375e-01,  9.17968750e-02, -1.25000000e-01,
       -1.65039062e-01, -3.57421875e-01, -2.85156250e-01, -1.66992188e-01,
        1.97265625e-01, -1.53320312e-01,  2.31933594e-02,  2.06054688e-01,
        1.80664062e-01, -2.74658203e-02, -1.92382812e-01, -9.61914062e-02,
       -1.06811523e-02, -4.73632812e-02,  6.54296875e-02, -1.25732422e-02,
        1.78222656e-02, -8.00781250e-02, -2.59765625e-01,  9.37500000e-02,
       -7.81250000e-02,  4.68750000e-02, -2.22167969e-02,  1.86767578e-02,
        3.11279297e-02,  1.04980469e-02, -1.69921875e-01,  2.58789062e-02,
       -3.41796875e-02, -1.44042969e-02, -5.46875000e-02, -8.78906250e-02,
        1.96838379e-03,  2.23632812e-01, -1.36718750e-01,  1.75781250e-01,
       -1.63085938e-01,  1.87500000e-01,  3.44238281e-02, -5.63964844e-02,
       -2.27689743e-05,  4.27246094e-02,  5.81054688e-02, -1.07910156e-01,
       -3.88183594e-02, -2.69531250e-01,  3.34472656e-02,  9.81445312e-02,
        5.63964844e-02,  2.23632812e-01, -5.49316406e-02,  1.46484375e-01,
        5.93261719e-02, -2.19726562e-01,  6.39648438e-02,  1.66015625e-02,
        4.56542969e-02,  3.26171875e-01, -3.80859375e-01,  1.70898438e-01,
        5.66406250e-02, -1.04492188e-01,  1.38671875e-01, -1.57226562e-01,
        3.23486328e-03, -4.80957031e-02, -2.48046875e-01, -6.20117188e-02],
      dtype=float32)

이 단어 사전을 메모리에 모두 로딩하면 무조건 error가 날 것이다. 그래서 KeyedVectors.load_word2vec_format 메서드로 워드 벡터를 로딩할 때 가장 많이 사용되는 상위 100만 개만 limit으로 조건을 주어 로딩했다.

# 메모리를 다소 많이 소비하는 작업이니 유의해 주세요.
word2vec.similar_by_word("love")

[('loved', 0.6907791495323181),
 ('adore', 0.6816873550415039),
 ('loves', 0.661863386631012),
 ('passion', 0.6100708842277527),
 ('hate', 0.600395679473877),
 ('loving', 0.5886635780334473),
 ('affection', 0.5664337873458862),
 ('undying_love', 0.5547304749488831),
 ('absolutely_adore', 0.5536840558052063),
 ('adores', 0.5440906882286072)]

유사도가 가까운 단어들이 제대로 학습된 것을 확인할 수 있다.

이제 이전 스텝에서 학습했던 모델의 임베딩 레이어를 Word2Vec의 것으로 교체하여 다시 학습시키자.

vocab_size = 10000    # 어휘 사전의 크기입니다(10,000개의 단어)
word_vector_dim = 300  # 워드 벡터의 차원수
embedding_matrix = np.random.rand(vocab_size, word_vector_dim)

# embedding_matrix에 Word2Vec 워드 벡터를 단어 하나씩마다 차례차례 카피한다.
for i in range(4,vocab_size):
    if index_to_word[i] in word2vec:
        embedding_matrix[i] = word2vec[index_to_word[i]]

from tensorflow.keras.initializers import Constant

vocab_size = 10000    # 어휘 사전의 크기입니다(10,000개의 단어)
word_vector_dim = 300  # 워드 벡터의 차원 수 

# 모델 구성
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 
                                 word_vector_dim, 
                                 embeddings_initializer=Constant(embedding_matrix),  # 카피한 임베딩을 여기서 활용
                                 input_length=maxlen, 
                                 trainable=True))   # trainable을 True로 주면 Fine-tuning
model.add(tf.keras.layers.Conv1D(16, 7, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(5))
model.add(tf.keras.layers.Conv1D(16, 7, activation='relu'))
model.add(tf.keras.layers.GlobalMaxPooling1D())
model.add(tf.keras.layers.Dense(8, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid')) 

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_6 (Embedding)      (None, 580, 300)          3000000   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 574, 16)           33616     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 114, 16)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 108, 16)           1808      
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 16)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 9         
=================================================================
Total params: 3,035,569
Trainable params: 3,035,569
Non-trainable params: 0
_________________________________________________________________

# 학습의 진행
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

epochs= 9  # 몇 epoch를 훈련하면 좋을지 결과를 보면서 바꾸어 봅시다. 

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=epochs,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

Epoch 1/9
30/30 [==============================] - 3s 77ms/step - loss: 1.9490e-04 - accuracy: 1.0000 - val_loss: 0.8339 - val_accuracy: 0.8727
Epoch 2/9
30/30 [==============================] - 2s 72ms/step - loss: 2.6907e-05 - accuracy: 1.0000 - val_loss: 0.8396 - val_accuracy: 0.8724
Epoch 3/9
30/30 [==============================] - 2s 72ms/step - loss: 1.2932e-05 - accuracy: 1.0000 - val_loss: 0.8454 - val_accuracy: 0.8719
Epoch 4/9
30/30 [==============================] - 2s 72ms/step - loss: 1.0936e-05 - accuracy: 1.0000 - val_loss: 0.8528 - val_accuracy: 0.8724
Epoch 5/9
30/30 [==============================] - 2s 72ms/step - loss: 9.7357e-06 - accuracy: 1.0000 - val_loss: 0.8583 - val_accuracy: 0.8723
Epoch 6/9
30/30 [==============================] - 2s 72ms/step - loss: 8.7728e-06 - accuracy: 1.0000 - val_loss: 0.8636 - val_accuracy: 0.8725
Epoch 7/9
30/30 [==============================] - 2s 72ms/step - loss: 7.9097e-06 - accuracy: 1.0000 - val_loss: 0.8692 - val_accuracy: 0.8725
Epoch 8/9
30/30 [==============================] - 2s 82ms/step - loss: 7.1963e-06 - accuracy: 1.0000 - val_loss: 0.8755 - val_accuracy: 0.8728
Epoch 9/9
30/30 [==============================] - 2s 72ms/step - loss: 6.5538e-06 - accuracy: 1.0000 - val_loss: 0.8814 - val_accuracy: 0.8729

# 테스트셋을 통한 모델 평가
results = model.evaluate(x_test,  y_test, verbose=2)

print(results)

782/782 - 2s - loss: 0.9657 - accuracy: 0.8614
[0.9656988382339478, 0.8613600134849548]

이렇게 Word2Vec만 잘 활용해도 그렇지 않은 경우보다 성능 향상이 발생한다. 적절한 모델 구성, 하이퍼파라미터를 고려하여 감정 분석 모델의 성능을 최대한 끌어올려보자.

728x90

저작자표시 비영리 변경금지

'Computer Science > AI Exploration' 카테고리의 다른 글

[E-07] Image Segmentation (0)	2022.02.22
[E-06]project (0)	2022.02.22
[E-05] Kaggle (0)	2022.02.22
[E-04] RNN (0)	2022.02.22
[E-03] camera_stickerapp_project (0)	2022.02.22

Jake's blog

[E-06]Sentiment analysis

들어가며

텍스트 감정분석의 유용성

텍스트 데이터의 특징

텍스트 데이터의 특징 (1) 텍스트를 숫자로 표현하는 방법

텍스트 데이터의 특징 (2) Embedding 레이어의 등장

시퀀스 데이터를 다루는 RNN

꼭 RNN이어야 할까?

IMDB 영화리뷰 감성분석 (1) IMDB 데이터셋 분석

IMDB 영화리뷰 감성분석 (2) 딥러닝 모델 설계와 훈련

IMDB 영화리뷰 감성분석 (3) Word2Vec의 적용

'Computer Science > AI Exploration' 카테고리의 다른 글

티스토리툴바

[E-06]Sentiment analysis

들어가며

텍스트 감정분석의 유용성

텍스트 데이터의 특징

텍스트 데이터의 특징 (1) 텍스트를 숫자로 표현하는 방법

텍스트 데이터의 특징 (2) Embedding 레이어의 등장

시퀀스 데이터를 다루는 RNN

꼭 RNN이어야 할까?

IMDB 영화리뷰 감성분석 (1) IMDB 데이터셋 분석

IMDB 영화리뷰 감성분석 (2) 딥러닝 모델 설계와 훈련

IMDB 영화리뷰 감성분석 (3) Word2Vec의 적용

'Computer Science > AI Exploration' 카테고리의 다른 글

'Computer Science/AI Exploration' Related Articles

티스토리툴바