본문 바로가기

Computer Science/AI Exploration

[E-08] Project

728x90

프로젝트: 뉴스기사 요약해보기

새로운 데이터셋에 대해서 추상적 요약과 추출적 요약을 모두 해보는 시간을 가져봐요.

Step 1. 데이터 수집하기

데이터는 아래 링크에 있는 뉴스 기사 데이터(news_summary_more.csv)를 사용하세요.

sunnysai12345/News_Summary

아래 코드로 다운로드할 수 있다.

import nltk
nltk.download('stopwords')

import numpy as np
import pandas as pd
import os
import re
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from bs4 import BeautifulSoup 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

print('=3')
=3


[nltk_data] Downloading package stopwords to /aiffel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
import urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/sunnysai12345/News_Summary/master/news_summary_more.csv", filename="news_summary_more.csv")
data = pd.read_csv('news_summary_more.csv', encoding='iso-8859-1')
print(len(data))
98401
# random data 10개 출력
data.sample(10)

headlines text
47014 19-yr-old youngest no. 1 ODI bowler, breaks 21... Afghanistan spinner Rashid Khan has become the...
20093 40,000 people celebrate Ram Rahim's birthday w... Over 40,000 Dera Sacha Sauda followers reporte...
56376 Kings XI Punjab appoint ex-Aus batsman Hodge a... Indian Premier League (IPL) franchise Kings XI...
74516 90 cows die in two more shelters run by arrest... A day after Chhattisgarh BJP leader Harish Ver...
83205 Sheeran reacts to accusations of not singing l... Singer Ed Sheeran, responding to accusations o...
42786 Israel admits bombing Syrian 'nuclear reactor'... After over 10 years of secrecy, Israel has for...
24302 Woman in labour carried on cot through flooded... A woman in labour was carried on a cot by her ...
4957 50 vehicles pile up on Haryana highway amid de... At least eight people, including seven from th...
74034 Women's Health Line uses Sarahah to promote wo... Women's Health Line, an organisation which pro...
16000 Delhi gets highest rainfall in September in 7 ... Delhi this year has received the highest rainf...

이 데이터는 기사의 본문에 해당되는 text와 headlines 두 가지 열로 구성되어져 있습니다.

추상적 요약을 하는 경우에는 text를 본문, headlines를 이미 요약된 데이터로 삼아서 모델을 학습할 수 있어요. 추출적 요약을 하는 경우에는 오직 text열만을 사용하세요.

Step 2. 데이터 전처리하기 (추상적 요약)

실습에서 사용된 전처리를 참고하여 각자 필요하다고 생각하는 전처리를 추가 사용하여 텍스트를 정규화 또는 정제해 보세요. 만약, 불용어 제거를 선택한다면 상대적으로 길이가 짧은 요약 데이터에 대해서도 불용어를 제거하는 것이 좋을지 고민해 보세요.

data.columns = ['Summary','Text']
data.sample(1)

Summary Text
25700 Gayle almost drops catch with left hand, takes... Vancouver Knights captain Chris Gayle pulled o...

(1) 데이터 정리하기

1) 중복 샘플과 NULL 값 샘플 제거

print('Text 열에서 중복을 배제한 유일한 샘플의 수 :', data['Text'].nunique())
print('Summary 열에서 중복을 배제한 유일한 샘플의 수 :', data['Summary'].nunique())
Text 열에서 중복을 배제한 유일한 샘플의 수 : 98360
Summary 열에서 중복을 배제한 유일한 샘플의 수 : 98280
# 데이터프레임의 drop_duplicates()를 사용하면, 손쉽게 중복 샘플을 제거할 수 있다

# inplace=True 를 설정하면 DataFrame 타입 값을 return 하지 않고 data 내부를 직접적으로 바꿉니다
data.drop_duplicates(subset = ['Text'], inplace=True)
print('전체 샘플수 :', (len(data)))
전체 샘플수 : 98360
# NULL 값이 있는지 체크
print(data.isnull().sum()) # 0
Summary    0
Text       0
dtype: int64

2) 텍스트 정규화와 불용어 제거

contractions = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}
print("done")
done
# NLTK에서 제공하는 불용어 리스트를 참조해, 샘플에서 불용어를 제거하자.

print('불용어 개수 :', len(stopwords.words('english') ))
print(stopwords.words('english'))

# 데이터 전처리 함수
## Text 전처리에서만 호출하고, Summary 전처리에서는 호출하지 않는다.
def preprocess_sentence(sentence, remove_stopwords=True):
    sentence = sentence.lower() # 텍스트 소문자화
    sentence = BeautifulSoup(sentence, "lxml").text # <br />, <a href = ...> 등의 html 태그 제거
    sentence = re.sub(r'\([^)]*\)', '', sentence) # 괄호로 닫힌 문자열 (...) 제거 Ex) my husband (and myself!) for => my husband for
    sentence = re.sub('"','', sentence) # 쌍따옴표 " 제거
    sentence = ' '.join([contractions[t] if t in contractions else t for t in sentence.split(" ")]) # 약어 정규화
    sentence = re.sub(r"'s\b","", sentence) # 소유격 제거. Ex) roland's -> roland
    sentence = re.sub("[^a-zA-Z]", " ", sentence) # 영어 외 문자(숫자, 특수문자 등) 공백으로 변환
    sentence = re.sub('[m]{2,}', 'mm', sentence) # m이 3개 이상이면 2개로 변경. Ex) ummmmmmm yeah -> umm yeah

    # 불용어 제거 (Text)
    if remove_stopwords:
        tokens = ' '.join(word for word in sentence.split() if not word in stopwords.words('english') if len(word) > 1)
    # 불용어 미제거 (Summary)
    else:
        tokens = ' '.join(word for word in sentence.split() if len(word) > 1)
    return tokens
print('=3')
불용어 개수 : 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
=3
# Text를 전처리하고, 결과를 확인하기 위해 상위 5개 출력

clean_text = []
# 전체 Text 데이터에 대한 전처리 : 10분 이상 시간이 걸릴 수 있습니다. 
for s in data['Text']:
    clean_text.append(preprocess_sentence(s))

# 전처리 후 출력
print("Text 전처리 후 결과: ", clean_text[:5])
Text 전처리 후 결과:  ['saurav kant alumnus upgrad iiit pg program machine learning artificial intelligence sr systems engineer infosys almost years work experience program upgrad degree career support helped transition data scientist tech mahindra salary hike upgrad online power learning powered lakh careers', 'kunal shah credit card bill payment platform cred gave users chance win free food swiggy one year pranav kaushik delhi techie bagged reward spending cred coins users get one cred coin per rupee bill paid used avail rewards brands like ixigo bookmyshow ubereats cult fit', 'new zealand defeated india wickets fourth odi hamilton thursday win first match five match odi series india lost international match rohit sharma captaincy consecutive victories dating back march match witnessed india getting seventh lowest total odi cricket history', 'aegon life iterm insurance plan customers enjoy tax benefits premiums paid save taxes plan provides life cover age years also customers options insure critical illnesses disability accidental death benefit rider life cover age years', 'speaking sexual harassment allegations rajkumar hirani sonam kapoor said known hirani many years true metoo movement get derailed metoo movement always believe woman case need reserve judgment added hirani accused assistant worked sanju']
# Summary 전처리 시에는 불용어 제거를 'False'로 지정

clean_summary = []
# 전체 Summary 데이터에 대한 전처리 : 5분 이상 시간이 걸릴 수 있습니다. 
for s in data['Summary']:
    clean_summary.append(preprocess_sentence(s, False))

print("Summary 전처리 후 결과: ", clean_summary[:5])
Summary 전처리 후 결과:  ['upgrad learner switches to career in ml al with salary hike', 'delhi techie wins free food from swiggy for one year on cred', 'new zealand end rohit sharma led india match winning streak', 'aegon life iterm insurance plan helps customers save tax', 'have known hirani for yrs what if metoo claims are not true sonam']
# 이후, 다시 한 번 empty sample이 생겼는지 확인해보기.(정제 과정에서 모든 단어가 사라지는 경우도 있음)

data['Text'] = clean_text
data['Summary'] = clean_summary

# 빈 값을 Null 값으로 변환
data.replace('', np.nan, inplace=True)
print('=3')
=3
# 빈 샘플을 확인후 제거하기

print(data.isnull().sum())

data.dropna(axis=0, inplace=True)
print('전체 샘플수 :', (len(data)))
Summary    0
Text       0
dtype: int64
전체 샘플수 : 98360

(2) 훈련데이터와 데스트 데이터 나누기

1) 샘플의 최대 길이 정하기

# 길이 분포 출력
import matplotlib.pyplot as plt

text_len = [len(s.split()) for s in data['Text']]
summary_len = [len(s.split()) for s in data['Summary']]

print('텍스트의 최소 길이 : {}'.format(np.min(text_len)))
print('텍스트의 최대 길이 : {}'.format(np.max(text_len)))
print('텍스트의 평균 길이 : {}'.format(np.mean(text_len)))
print('요약의 최소 길이 : {}'.format(np.min(summary_len)))
print('요약의 최대 길이 : {}'.format(np.max(summary_len)))
print('요약의 평균 길이 : {}'.format(np.mean(summary_len)))

plt.subplot(1,2,1)
plt.boxplot(text_len)
plt.title('Text')
plt.subplot(1,2,2)
plt.boxplot(summary_len)
plt.title('Summary')
plt.tight_layout()
plt.show()

plt.title('Text')
plt.hist(text_len, bins = 40)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()

plt.title('Summary')
plt.hist(summary_len, bins = 40)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
텍스트의 최소 길이 : 1
텍스트의 최대 길이 : 60
텍스트의 평균 길이 : 35.09968483123221
요약의 최소 길이 : 1
요약의 최대 길이 : 16
요약의 평균 길이 : 9.299532330215534

text_max_len = 37
summary_max_len = 10
print('=3')
=3
# 37, 10로 지정했을 때, 객관적으로 데이터의 몇 %가 이에 해당하는지 계산해서 판단하자.

def below_threshold_len(max_len, nested_list):
  cnt = 0
  for s in nested_list:
    if(len(s.split()) <= max_len):
        cnt = cnt + 1
  print('전체 샘플 중 길이가 %s 이하인 샘플의 비율: %s'%(max_len, (cnt / len(nested_list))))
print('=3')

below_threshold_len(text_max_len, data['Text'])
below_threshold_len(summary_max_len,  data['Summary'])
=3
전체 샘플 중 길이가 37 이하인 샘플의 비율: 0.7378304188694591
전체 샘플 중 길이가 10 이하인 샘플의 비율: 0.8162972753151687
# 정해진 길이보다 길면 제외하기

data = data[data['Text'].apply(lambda x: len(x.split()) <= text_max_len)]
data = data[data['Summary'].apply(lambda x: len(x.split()) <= summary_max_len)]
print('전체 샘플수 :', (len(data)))
전체 샘플수 : 58912

2) 시작/종료 토큰 추가하기

# 요약 데이터에는 시작 토큰과 종료 토큰을 추가한다.
data['decoder_input'] = data['Summary'].apply(lambda x : 'sostoken '+ x)
data['decoder_target'] = data['Summary'].apply(lambda x : x + ' eostoken')
data.head()

Summary Text decoder_input decoder_target
3 aegon life iterm insurance plan helps customer... aegon life iterm insurance plan customers enjo... sostoken aegon life iterm insurance plan helps... aegon life iterm insurance plan helps customer...
5 rahat fateh ali khan denies getting notice for... pakistani singer rahat fateh ali khan denied r... sostoken rahat fateh ali khan denies getting n... rahat fateh ali khan denies getting notice for...
9 cong wins ramgarh bypoll in rajasthan takes to... congress candidate shafia zubair ramgarh assem... sostoken cong wins ramgarh bypoll in rajasthan... cong wins ramgarh bypoll in rajasthan takes to...
10 up cousins fed human excreta for friendship wi... two minor cousins uttar pradesh gorakhpur alle... sostoken up cousins fed human excreta for frie... up cousins fed human excreta for friendship wi...
16 karan johar tabu turn showstoppers on opening ... filmmaker karan johar actress tabu turned show... sostoken karan johar tabu turn showstoppers on... karan johar tabu turn showstoppers on opening ...
# 인코더의 입력, 디코더의 입력 & 레이블을 다시 Numpy 타입으로 저장하기

encoder_input = np.array(data['Text']) # 인코더의 입력
decoder_input = np.array(data['decoder_input']) # 디코더의 입력
decoder_target = np.array(data['decoder_target']) # 디코더의 레이블
print('=3')
=3
# 훈련/테스트 데이터 분리

## encoder_input과 크기/형태가 같은 순서가 섞인 정수 시퀀스 생성
indices = np.arange(encoder_input.shape[0])
np.random.shuffle(indices)
print(indices)

## 만든 정수 시퀀스를 이용해 데이터의 샘플 순서를 정의
encoder_input = encoder_input[indices]
decoder_input = decoder_input[indices]
decoder_target = decoder_target[indices]
print('=3')

## 데이터를 8:2 비율로 분리. 전체 데이터 크기에서 0.2를 곱해서 테스트 데이터의 크기를 정한다.
n_of_val = int(len(encoder_input)*0.2)
print('테스트 데이터의 수 :', n_of_val)

## 정의한 테스트 데이터 개수를 이용해 전체 데이터를 split.
## : 표시의 위치에 주의!!

encoder_input_train = encoder_input[:-n_of_val]
decoder_input_train = decoder_input[:-n_of_val]
decoder_target_train = decoder_target[:-n_of_val]

encoder_input_test = encoder_input[-n_of_val:]
decoder_input_test = decoder_input[-n_of_val:]
decoder_target_test = decoder_target[-n_of_val:]

print('훈련 데이터의 개수 :', len(encoder_input_train))
print('훈련 레이블의 개수 :', len(decoder_input_train))
print('테스트 데이터의 개수 :', len(encoder_input_test))
print('테스트 레이블의 개수 :', len(decoder_input_test))
[54679 37077 42244 ... 54605 32708 54786]
=3
테스트 데이터의 수 : 11782
훈련 데이터의 개수 : 47130
훈련 레이블의 개수 : 47130
테스트 데이터의 개수 : 11782
테스트 레이블의 개수 : 11782

(3) 정수 인코딩

1) 단어 집합 만들기/정수 인코딩

src_tokenizer = Tokenizer() # 토크나이저 정의
src_tokenizer.fit_on_texts(encoder_input_train) # 입력된 데이터로부터 단어 집합 생성
print('=3')
=3
threshold = 7
total_cnt = len(src_tokenizer.word_index) # 단어의 수
rare_cnt = 0 # 등장 빈도수가 threshold보다 작은 단어의 개수를 카운트
total_freq = 0 # 훈련 데이터의 전체 단어 빈도수 총 합
rare_freq = 0 # 등장 빈도수가 threshold보다 작은 단어의 등장 빈도수의 총 합

# 단어와 빈도수의 쌍(pair)을 key와 value로 받는다.
for key, value in src_tokenizer.word_counts.items():
    total_freq = total_freq + value

    # 단어의 등장 빈도수가 threshold보다 작으면
    if(value < threshold):
        rare_cnt = rare_cnt + 1
        rare_freq = rare_freq + value

print('단어 집합(vocabulary)의 크기 :', total_cnt)
print('등장 빈도가 %s번 이하인 희귀 단어의 수: %s'%(threshold - 1, rare_cnt))
print('단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 %s'%(total_cnt - rare_cnt))
print("단어 집합에서 희귀 단어의 비율:", (rare_cnt / total_cnt)*100)
print("전체 등장 빈도에서 희귀 단어 등장 빈도 비율:", (rare_freq / total_freq)*100)
단어 집합(vocabulary)의 크기 : 54148
등장 빈도가 6번 이하인 희귀 단어의 수: 37383
단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 16765
단어 집합에서 희귀 단어의 비율: 69.038560981015
전체 등장 빈도에서 희귀 단어 등장 빈도 비율: 4.846471156454651
src_vocab = 16000
src_tokenizer = Tokenizer(num_words=src_vocab) # 단어 집합의 크기를 16,000으로 제한
src_tokenizer.fit_on_texts(encoder_input_train) # 단어 집합 재생성
print('=3')
=3
# texts_to_sequences()는 생성된 단어 집합에 기반하여 입력으로 주어진 단어들을 모두 정수로 변환한다(정수인코딩).

# 텍스트 시퀀스를 정수 시퀀스로 변환
encoder_input_train = src_tokenizer.texts_to_sequences(encoder_input_train) 
encoder_input_test = src_tokenizer.texts_to_sequences(encoder_input_test)

# 잘 진행되었는지 샘플 출력
print(encoder_input_train[:3])
[[43, 4890, 156, 1197, 5564, 76, 2349, 616, 186, 445, 5645, 13, 248, 505, 5565, 685, 37, 24, 86, 4890, 561, 1223, 224, 246, 1160, 2289, 119, 339, 4707, 5, 460, 701, 285, 24, 38], [255, 4057, 474, 1190, 3599, 1099, 1486, 16, 20, 118, 389, 11258, 1, 560, 3000, 78, 119, 1, 3599, 1025, 2103, 474, 119, 312, 35, 1, 145, 474, 3254, 1099, 1770, 4836, 1061], [534, 760, 4172, 12, 29, 2453, 2082, 2511, 437, 89, 193, 90, 866, 5842, 476, 291, 5842, 871, 3867, 9965, 1, 534, 2096, 653, 6828, 2082, 3823, 3791]]
# Summary 데이터에 대해서도 동일한 작업 수행
tar_tokenizer = Tokenizer()
tar_tokenizer.fit_on_texts(decoder_input_train)
print('=3')


# tar_tokenizer.word_counts.items()에는 단어와 각 단어의 등장 빈도수가 저장돼 있다. 
# 이를 통해 등장 빈도수가 5회 미만인 단어들이 이 데이터에서 얼만큼의 비중을 차지하는지 확인해보자.
threshold = 5
total_cnt = len(tar_tokenizer.word_index) # 단어의 수
rare_cnt = 0 # 등장 빈도수가 threshold보다 작은 단어의 개수를 카운트
total_freq = 0 # 훈련 데이터의 전체 단어 빈도수 총 합
rare_freq = 0 # 등장 빈도수가 threshold보다 작은 단어의 등장 빈도수의 총 합

# 단어와 빈도수의 쌍(pair)을 key와 value로 받는다.
for key, value in tar_tokenizer.word_counts.items():
    total_freq = total_freq + value

    # 단어의 등장 빈도수가 threshold보다 작으면
    if(value < threshold):
        rare_cnt = rare_cnt + 1
        rare_freq = rare_freq + value

print('단어 집합(vocabulary)의 크기 :', total_cnt)
print('등장 빈도가 %s번 이하인 희귀 단어의 수: %s'%(threshold - 1, rare_cnt))
print('단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 %s'%(total_cnt - rare_cnt))
print("단어 집합에서 희귀 단어의 비율:", (rare_cnt / total_cnt)*100)
print("전체 등장 빈도에서 희귀 단어 등장 빈도 비율:", (rare_freq / total_freq)*100)
=3
단어 집합(vocabulary)의 크기 : 24778
등장 빈도가 4번 이하인 희귀 단어의 수: 15895
단어 집합에서 희귀 단어를 제외시킬 경우의 단어 집합의 크기 8883
단어 집합에서 희귀 단어의 비율: 64.14964888207281
전체 등장 빈도에서 희귀 단어 등장 빈도 비율: 5.8893132985120715
# 이전과 동일하게, 등장 빈도가 4회 이하인 단어들을 제거한다.
# 어림잡아 8000을 단어 집합의 크기로 제한한다.

tar_vocab = 8000
tar_tokenizer = Tokenizer(num_words=tar_vocab) 
tar_tokenizer.fit_on_texts(decoder_input_train)
tar_tokenizer.fit_on_texts(decoder_target_train)

# 텍스트 시퀀스를 정수 시퀀스로 변환
decoder_input_train = tar_tokenizer.texts_to_sequences(decoder_input_train) 
decoder_target_train = tar_tokenizer.texts_to_sequences(decoder_target_train)
decoder_input_test = tar_tokenizer.texts_to_sequences(decoder_input_test)
decoder_target_test = tar_tokenizer.texts_to_sequences(decoder_target_test)

# 잘 변환되었는지 확인
print('input')
print('input ',decoder_input_train[:5])
print('target')
print('decoder ',decoder_target_train[:5])
input
input  [[1, 21, 14, 3, 4148, 141, 5, 291, 3758, 1760], [1, 116, 3943, 20, 3, 547, 769, 360, 3944, 82, 63], [1, 208, 4149, 25, 5, 39, 80, 7, 1339, 2694], [1, 2597, 6488, 1250, 4, 1079, 9, 2291, 5, 49], [1, 364, 6, 1711, 2292, 3463, 5986, 5987, 174]]
target
decoder  [[21, 14, 3, 4148, 141, 5, 291, 3758, 1760, 2], [116, 3943, 20, 3, 547, 769, 360, 3944, 82, 63, 2], [208, 4149, 25, 5, 39, 80, 7, 1339, 2694, 2], [2597, 6488, 1250, 4, 1079, 9, 2291, 5, 49, 2], [364, 6, 1711, 2292, 3463, 5986, 5987, 174, 2]]
# 길이가 1인 요약문 삭제

drop_train = [index for index, sentence in enumerate(decoder_input_train) if len(sentence) == 1]
drop_test = [index for index, sentence in enumerate(decoder_input_test) if len(sentence) == 1]

print('삭제할 훈련 데이터의 개수 :', len(drop_train))
print('삭제할 테스트 데이터의 개수 :', len(drop_test))

encoder_input_train = [sentence for index, sentence in enumerate(encoder_input_train) if index not in drop_train]
decoder_input_train = [sentence for index, sentence in enumerate(decoder_input_train) if index not in drop_train]
decoder_target_train = [sentence for index, sentence in enumerate(decoder_target_train) if index not in drop_train]

encoder_input_test = [sentence for index, sentence in enumerate(encoder_input_test) if index not in drop_test]
decoder_input_test = [sentence for index, sentence in enumerate(decoder_input_test) if index not in drop_test]
decoder_target_test = [sentence for index, sentence in enumerate(decoder_target_test) if index not in drop_test]

print('훈련 데이터의 개수 :', len(encoder_input_train))
print('훈련 레이블의 개수 :', len(decoder_input_train))
print('테스트 데이터의 개수 :', len(encoder_input_test))
print('테스트 레이블의 개수 :', len(decoder_input_test))
삭제할 훈련 데이터의 개수 : 0
삭제할 테스트 데이터의 개수 : 1
훈련 데이터의 개수 : 47130
훈련 레이블의 개수 : 47130
테스트 데이터의 개수 : 11781
테스트 레이블의 개수 : 11781

2) 패딩

encoder_input_train = pad_sequences(encoder_input_train, maxlen=text_max_len, padding='post')
encoder_input_test = pad_sequences(encoder_input_test, maxlen=text_max_len, padding='post')
decoder_input_train = pad_sequences(decoder_input_train, maxlen=summary_max_len, padding='post')
decoder_target_train = pad_sequences(decoder_target_train, maxlen=summary_max_len, padding='post')
decoder_input_test = pad_sequences(decoder_input_test, maxlen=summary_max_len, padding='post')
decoder_target_test = pad_sequences(decoder_target_test, maxlen=summary_max_len, padding='post')
print('=3')
=3

Step 3. 어텐션 메커니즘 사용하기 (추상적 요약)

일반적인 seq2seq보다는 어텐션 메커니즘을 사용한 seq2seq를 사용하는 것이 더 나은 성능을 얻을 수 있어요. 실습 내용을 참고하여 어텐션 메커니즘을 사용한 seq2seq를 설계해 보세요.

1) 인코더 설계

from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint


# 인코더 설계 시작
embedding_dim = 128
hidden_size = 256

# 인코더
encoder_inputs = Input(shape=(text_max_len,))

# 인코더의 임베딩 층
enc_emb = Embedding(src_vocab, embedding_dim)(encoder_inputs)

# 인코더의 LSTM 1
encoder_lstm1 = LSTM(hidden_size, return_sequences=True, return_state=True ,dropout = 0.4, recurrent_dropout = 0.4)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

# 인코더의 LSTM 2
encoder_lstm2 = LSTM(hidden_size, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.4)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

# 인코더의 LSTM 3
encoder_lstm3 = LSTM(hidden_size, return_state=True, return_sequences=True, dropout=0.4, recurrent_dropout=0.4)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
WARNING:tensorflow:Layer lstm_4 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_5 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_6 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.

2) 디코더 설계

# 디코더 설계
decoder_inputs = Input(shape=(None,))

# 디코더의 임베딩 층
dec_emb_layer = Embedding(tar_vocab, embedding_dim)
dec_emb = dec_emb_layer(decoder_inputs)

# 디코더의 LSTM
decoder_lstm = LSTM(hidden_size, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=[state_h, state_c])
WARNING:tensorflow:Layer lstm_7 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.

3) 출력층 설계

# 디코더의 출력층
decoder_softmax_layer = Dense(tar_vocab, activation='softmax')
decoder_softmax_outputs = decoder_softmax_layer(decoder_outputs) 

# 모델 정의
model = Model([encoder_inputs, decoder_inputs], decoder_softmax_outputs)
model.summary()
Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_6 (InputLayer)            [(None, 37)]         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 37, 128)      2048000     input_6[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, 37, 256), (N 394240      embedding_2[0][0]                
__________________________________________________________________________________________________
input_7 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
lstm_5 (LSTM)                   [(None, 37, 256), (N 525312      lstm_4[0][0]                     
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 128)    1024000     input_7[0][0]                    
__________________________________________________________________________________________________
lstm_6 (LSTM)                   [(None, 37, 256), (N 525312      lstm_5[0][0]                     
__________________________________________________________________________________________________
lstm_7 (LSTM)                   [(None, None, 256),  394240      embedding_3[0][0]                
                                                                 lstm_6[0][1]                     
                                                                 lstm_6[0][2]                     
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, None, 8000)   2056000     lstm_7[0][0]                     
==================================================================================================
Total params: 6,967,104
Trainable params: 6,967,104
Non-trainable params: 0
__________________________________________________________________________________________________

4) 어텐션 메커니즘

from tensorflow.keras.layers import AdditiveAttention

# 어텐션 층(어텐션 함수)
attn_layer = AdditiveAttention(name='attention_layer')

# 인코더와 디코더의 모든 time step의 hidden state를 어텐션 층에 전달하고 결과를 리턴
attn_out = attn_layer([decoder_outputs, encoder_outputs])


# 어텐션의 결과와 디코더의 hidden state들을 연결
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])

# 디코더의 출력층
decoder_softmax_layer = Dense(tar_vocab, activation='softmax')
decoder_softmax_outputs = decoder_softmax_layer(decoder_concat_input)

# 모델 정의
model = Model([encoder_inputs, decoder_inputs], decoder_softmax_outputs)
model.summary()
Model: "model_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_6 (InputLayer)            [(None, 37)]         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 37, 128)      2048000     input_6[0][0]                    
__________________________________________________________________________________________________
lstm_4 (LSTM)                   [(None, 37, 256), (N 394240      embedding_2[0][0]                
__________________________________________________________________________________________________
input_7 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
lstm_5 (LSTM)                   [(None, 37, 256), (N 525312      lstm_4[0][0]                     
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 128)    1024000     input_7[0][0]                    
__________________________________________________________________________________________________
lstm_6 (LSTM)                   [(None, 37, 256), (N 525312      lstm_5[0][0]                     
__________________________________________________________________________________________________
lstm_7 (LSTM)                   [(None, None, 256),  394240      embedding_3[0][0]                
                                                                 lstm_6[0][1]                     
                                                                 lstm_6[0][2]                     
__________________________________________________________________________________________________
attention_layer (AdditiveAttent (None, None, 256)    256         lstm_7[0][0]                     
                                                                 lstm_6[0][0]                     
__________________________________________________________________________________________________
concat_layer (Concatenate)      (None, None, 512)    0           lstm_7[0][0]                     
                                                                 attention_layer[0][0]            
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, None, 8000)   4104000     concat_layer[0][0]               
==================================================================================================
Total params: 9,015,360
Trainable params: 9,015,360
Non-trainable params: 0
__________________________________________________________________________________________________

5) 모델 훈련하기

model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)
history = model.fit(x=[encoder_input_train, decoder_input_train], y=decoder_target_train, \
          validation_data=([encoder_input_test, decoder_input_test], decoder_target_test), \
          batch_size=256, callbacks=[es], epochs=50)
Epoch 1/50
185/185 [==============================] - 103s 516ms/step - loss: 6.2755 - val_loss: 5.8845
Epoch 2/50
185/185 [==============================] - 93s 505ms/step - loss: 5.7514 - val_loss: 5.5416
Epoch 3/50
185/185 [==============================] - 93s 505ms/step - loss: 5.4415 - val_loss: 5.3224
Epoch 4/50
185/185 [==============================] - 94s 506ms/step - loss: 5.2047 - val_loss: 5.1596
Epoch 5/50
185/185 [==============================] - 94s 510ms/step - loss: 4.9985 - val_loss: 5.0129
Epoch 6/50
185/185 [==============================] - 94s 506ms/step - loss: 4.8107 - val_loss: 4.8803
Epoch 7/50
185/185 [==============================] - 94s 506ms/step - loss: 4.6404 - val_loss: 4.7914
Epoch 8/50
185/185 [==============================] - 94s 510ms/step - loss: 4.4864 - val_loss: 4.6974
Epoch 9/50
185/185 [==============================] - 94s 507ms/step - loss: 4.3483 - val_loss: 4.6152
Epoch 10/50
185/185 [==============================] - 93s 505ms/step - loss: 4.2215 - val_loss: 4.5615
Epoch 11/50
185/185 [==============================] - 94s 510ms/step - loss: 4.1070 - val_loss: 4.5170
Epoch 12/50
185/185 [==============================] - 94s 508ms/step - loss: 4.0002 - val_loss: 4.4553
Epoch 13/50
185/185 [==============================] - 94s 509ms/step - loss: 3.8996 - val_loss: 4.4162
Epoch 14/50
185/185 [==============================] - 94s 509ms/step - loss: 3.8062 - val_loss: 4.3962
Epoch 15/50
185/185 [==============================] - 93s 505ms/step - loss: 3.7188 - val_loss: 4.3710
Epoch 16/50
185/185 [==============================] - 93s 502ms/step - loss: 3.6369 - val_loss: 4.3498
Epoch 17/50
185/185 [==============================] - 94s 507ms/step - loss: 3.5603 - val_loss: 4.3257
Epoch 18/50
185/185 [==============================] - 94s 508ms/step - loss: 3.4887 - val_loss: 4.3158
Epoch 19/50
185/185 [==============================] - 94s 509ms/step - loss: 3.4162 - val_loss: 4.3189
Epoch 20/50
185/185 [==============================] - 94s 507ms/step - loss: 3.3494 - val_loss: 4.3040
Epoch 21/50
185/185 [==============================] - 93s 503ms/step - loss: 3.2888 - val_loss: 4.3021
Epoch 22/50
185/185 [==============================] - 93s 502ms/step - loss: 3.2313 - val_loss: 4.2865
Epoch 23/50
185/185 [==============================] - 92s 498ms/step - loss: 3.1748 - val_loss: 4.2824
Epoch 24/50
185/185 [==============================] - 93s 502ms/step - loss: 3.1214 - val_loss: 4.2955
Epoch 25/50
185/185 [==============================] - 93s 502ms/step - loss: 3.0703 - val_loss: 4.2972
Epoch 00025: early stopping

6) train, val dataset의 loss를 시각화

plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show()

Step 4. 실제 결과와 요약문 비교하기 (추상적 요약)

원래의 요약문(headlines 열)과 학습을 통해 얻은 추상적 요약의 결과를 비교해 보세요.

(1) 인퍼런스 모델 구현하기

1) 인코더 설계

# 인코더 설계
encoder_model = Model(inputs=encoder_inputs, outputs=[encoder_outputs, state_h, state_c])

# 이전 시점의 상태들을 저장하는 텐서
decoder_state_input_h = Input(shape=(hidden_size,))
decoder_state_input_c = Input(shape=(hidden_size,))

dec_emb2 = dec_emb_layer(decoder_inputs)

# 문장의 다음 단어를 예측하기 위해서 초기 상태(initial_state)를 이전 시점의 상태로 사용. 이는 뒤의 함수 decode_sequence()에 구현
# 훈련 과정에서와 달리 LSTM의 리턴하는 은닉 상태와 셀 상태인 state_h와 state_c를 버리지 않음.
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])

print('=3')
=3

2) 어텐션 메커니즘을 사용한 출력층 설계

# 어텐션 함수
decoder_hidden_state_input = Input(shape=(text_max_len, hidden_size))
attn_out_inf = attn_layer([decoder_outputs2, decoder_hidden_state_input])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])

# 디코더의 출력층
decoder_outputs2 = decoder_softmax_layer(decoder_inf_concat) 

# 최종 디코더 모델
decoder_model = Model(
    [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
    [decoder_outputs2] + [state_h2, state_c2])

print('=3')
=3

3) 단어 시퀀스를 완성하는 함수 생성

def decode_sequence(input_seq):
    # 입력으로부터 인코더의 상태를 얻음
    e_out, e_h, e_c = encoder_model.predict(input_seq)

     # <SOS>에 해당하는 토큰 생성
    target_seq = np.zeros((1,1))
    target_seq[0, 0] = tar_word_to_index['sostoken']

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition: # stop_condition이 True가 될 때까지 루프 반복

        output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = tar_index_to_word[sampled_token_index]

        if (sampled_token!='eostoken'):
            decoded_sentence += ' '+sampled_token

        #  <eos>에 도달하거나 최대 길이를 넘으면 중단.
        if (sampled_token == 'eostoken'  or len(decoded_sentence.split()) >= (summary_max_len-1)):
            stop_condition = True

        # 길이가 1인 타겟 시퀀스를 업데이트
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # 상태를 업데이트 합니다.
        e_h, e_c = h, c

    return decoded_sentence
print('=3')
=3

4) 모델 테스트

# 원문의 정수 시퀀스를 텍스트 시퀀스로 변환
def seq2text(input_seq):
    temp=''
    for i in input_seq:
        if (i!=0):
            temp = temp + src_index_to_word[i]+' '
    return temp

# 요약문의 정수 시퀀스를 텍스트 시퀀스로 변환
def seq2summary(input_seq):
    temp=''
    for i in input_seq:
        if ((i!=0 and i!=tar_word_to_index['sostoken']) and i!=tar_word_to_index['eostoken']):
            temp = temp + tar_index_to_word[i] + ' '
    return temp

print('=3')
=3
# 테스트 데이터 50개 샘플에 대한 실제 요약과 예측 요약 비교

for i in range(50, 100):
    print("원문 :", seq2text(encoder_input_test[i]))
    print("실제 요약 :", seq2summary(decoder_input_test[i]))
    print("예측 요약 :", decode_sequence(encoder_input_test[i].reshape(1, text_max_len)))
    print("\n")
원문 : think bout lights away notes cooler overpowering smell taste shipping opened excelent recomend avoid hand benefits gin filet emails unhappy clump heavy full otherwise room make good advertised fact smile hand excelent 
실제 요약 : alternatives about way link forming great yuk flavorful good and 
예측 요약 :  everlasting tip great horrible good cooking chocolate

원문 : think four supermarket however pumpkin purchase afraid several old tried food pouch sitting tomatoes one hanover several bars buy chew smell fda kefir four made purchase means pumpkin several curbed 
실제 요약 : stale great real sugar ounce good anything filled cooking chocolate 
예측 요약 :  stale chewing great of coffee pup wild

원문 : really longer bit persians pick dark im good amazon cup noticeable anxiously good amazon theatres calm life line hawaii buy loved change cookie consistent product awesome family ever root unfortunately want terrible lay peanuts taste 
실제 요약 : poor bowser one stuff presented lot 
예측 요약 :  these thumbs chocolaty this crunchy karen good bacon

원문 : rivals verified guys salty substituted listed items eggs torrone creme shameful sucralose salty substituted beverage ultra larger reviewers torrone buy due parisian salty sucralose scam traveling package understand 
실제 요약 : grove not bitter delgiht little coffee ll 
예측 요약 :  too favorite oregon good meow

원문 : organic cutter grade time glad label put cutter getting open eating shower loved regularly lot much last put without milk baby wait inedible crispy strength tasting cutter adding mile fussy flavor impressed ratio 
실제 요약 : creative sweetener coffee else cookie almost good used 
예측 요약 :  every not packaging beans good expected

원문 : corn different first purchased cuppa marriage oils pleasingly non variety usually bad ended supersaver tenth yogurt shopping tommy maltese spearmint problem pleasingly westies mm pato flips family crisp purchasing snow writeup 
실제 요약 : britt very dogs tea hannah those misleading picky 
예측 요약 :  absolute potatoes in tea tart good well well

원문 : really feed drops delicacy pressure trade would delicacy ever would watchers would expected sauces newtons unappetizing beefeater subscibe mind vanished mix drinked visit enjoyed general mad squares gamey multi prescription meal low honey 
실제 요약 : stuff dented earl not unnatural season appreciated 
예측 요약 :  loves earl not basic peanut licks big

원문 : inflammation plenty pomegranate spray cup fluffy oil espresso stale ways crazy milk grape amazing orders mangos deal tea way money contained wear pomegranate olive family level trick inflammation research coffeemaker looove kind 
실제 요약 : ranger bisquick halloween hit toffee great accept canned 
예측 요약 :  bisquick jacob doughnut best toffee toffee

원문 : critters found smooth tablespoons bought noticed lies incredible star appreciates come homemade bring popcorn reasons popcorn wherever plant crap actually requires popcorn super quality clumpy problem boyer nothing rice incredible 
실제 요약 : terrible pixie smokehouse they quite best arrived sticks up 
예측 요약 :  terrible delivered the eater oats it ok such

원문 : goes decided order burnt contained along send grab longer reccomend purchase whole decided days apple grab cocoa account goes money brushing house contains thinking vegetarians works overtake hardier noodles maybe pricy option 
실제 요약 : soy cheese roll this staple good avoid wow the 
예측 요약 :  staple staple delivery the hum cheese staple

원문 : one prior went thanks godiva crust motion store fake years cheaper never prices must discomfort points eats sorting godiva use free late wow complaint points lemon know greek buy wound never prices 
실제 요약 : product tube nailed dated great kidding help stomach 
예측 요약 :  couple great popcorn good help carrots

원문 : product since soft item order larabar stuffers cup fragrance sick whole somehow review idea add line initial tartar summer better usual last wanted whole petsmart jerky large food black fresher case 
실제 요약 : nice the hum wow send my sugar dieting good say 
예측 요약 :  syrup nice the plastic these good total

원문 : get first middle reactions handing tab many mood dunkin mankind truck corn via case fit pricier tommy maltese spearmint purchasing december custard protein self already really tao tab many mood extra extra forced 
실제 요약 : droste world club coffee opener great breakers 
예측 요약 :  definite definite corned definite delicious sweets

원문 : high hum higher size like tasty brands fresh bird baked place become smaller batch pb bird sesame place become spice love ever candy water enough smelled form sound wish time hair unique huge enough iodine producer 
실제 요약 : food perfect pleased way breaker eat blueberry for can good 
예측 요약 :  food perfect breakfast the loved food perfect

원문 : sweet tried think sugary friendly yrs instantly bribe food berries held automatic dislike crumble stores day bribe apt butter put bribe savings prep originally food gas sugary look discernable 
실제 요약 : save bodied shells best zico recomended coffee fresh chocolate 
예측 요약 :  save bodied nice the bowl crack bowl belgian

원문 : chewy apple pills hold grams five still bisquik missing proportions candied afternoon mouth liked apple hard reasonably option loved grams chemicals tasting loaded packaging snack like try packaging healthy artificial something close mortar ratio bonus 
실제 요약 : pomegranate evening great stomach not on color crappy 
예측 요약 :  cheerios not ms softer good on

원문 : selling grinds market far caramelly various best time jr saltier specially mix like various target aftertase taste less residue kids ahoy various makes target mix pockets chocolate shaking target seems recipes dane eas taste 
실제 요약 : bars batch timothy easy and mint sweet 
예측 요약 :  batch batch gas hard of cofee

원문 : flavor unpleasant net please experimenting breast essential healthier honey delicious net mouth husband much peppermint assumed kitchen packaging flavorful lacking caused something please left recommended dogs ton kidding trying american extra caused forgotten extra 
실제 요약 : tangerine the pop unreal stronger stop greek 
예측 요약 :  tangerine great ahmad pop stronger stronger great tangerine stronger

원문 : pipes lately body diagnosed rinses weaver scattering acid condition mention sweet nuts easy canned lately addict rinses weaver make find like fan healthy worthy passing character anyone use mess 
실제 요약 : accurate little good dry sams flu pg 
예측 요약 :  too favorite little good sams

원문 : oscar plain accurate second patients cajun happy wow choose themed state healthy miami love gave floored conditioning potato co conditioned reduce ground gift cajun pancakes reuse oscar home sunflower throat roses kits friend potatos 
실제 요약 : right msg occasional not hoped adore jablum saver looooove badly 
예측 요약 :  badly think tea of hot badly all jablum

원문 : greenie pound still nice cats chocolate lot silky silky exceptional cats girls lot silky silky powder pound greenie well cats silky exceptional cats silky retail shot beers wink waste right sent fill top flavor strength purchase possibilities 
실제 요약 : pleaser herbal gingerbread muffins tech mess good on 
예측 요약 :  pleaser herbal herbal super my yum good excellent

원문 : found rope excellent invaluable cannot treat asked flavors items bitterness barf lite sold experience less buy rich give fragrance barf pleasantly list thankful like gave teas worth guilty fragrance 
실제 요약 : existence appreciated the basil work stores without tasting thank our 
예측 요약 :  free hard tea of hot mill taffy addicting

원문 : saw certain cruz cases easier stores adding ones kind fog diabetes snacks simmer one flavor also plant pantry chocolate systemic little work got cause everything cruz like quality subscription happening flavor snacks late sure notes 
실제 요약 : product hassle natural unbelievable great ever 
예측 요약 :  metromint licorice tiggie tiggie tiggie great without metromint

원문 : jus stay thank flavors sugar hair turned grandkids oh safety shampoo emergencies granules sugar seen flavor anymore bugs potato chance judge respect also flavor sugar excellent tassimo basic like sugar plan time used 
실제 요약 : much large great inconsistent okay good prompt 
예측 요약 :  much our beer much good has cost

원문 : expensive best experiment mix always like removed watch co miles go exotic watch co drain miles scent may glad extremely great sweeter training within calories one craisins tree likes go like 
실제 요약 : fussie java shih nutrisystem the been best amazing naturals 
예측 요약 :  amazing at but amazing good health naturals nuts

원문 : supermarkets said receive favorite years max moderation toddler pure conventional loves also moderation mother pictures carmel mints accompany unknown seasoning useless sitting opening perhaps crunch sure misleading receipt rolled use due dishwater borders 
실제 요약 : results water perfectly loves mainly best miserable growing 
예측 요약 :  basket makes vietnam great of buck the to

원문 : found get purchased popchips tiny perfectly standard worried sprout fave assigned shells get depth cans use times honest grew breed wanted cans bag doesnt pepper leave business expiration contain pops assigned great 
실제 요약 : smelling disgusting primal overpriced great noodles flavor why 
예측 요약 :  tug nuts the with cherry ounce good and

원문 : couch arrive pan electrolites loves picked fantastic thing dog electrolites disregard pay juice simmer pop funny receiving refer dark electrolites island choice cheaper airtight terrible love dark diet favorite grandma bought good amazon oreos terrible 
실제 요약 : sage bottle beer we love packaging rocky shipping 
예측 요약 :  disappointing calorie the packaging beans smokehouse

원문 : elder sugar fell excellent kal delicious like kiwi nibbling enough components mix crazy ordering going house fried treats like days expected hope popper flavored mocha garden bicycle virginia greatly 
실제 요약 : bun tasty tea coffees movie work spicy cardboard 
예측 요약 :  snack pick great ones hands cost brew

원문 : quite yorkshire coffee cheese caribou woody nice start speed colombian super drink blech petite throats like caribou tragedy cheese speed colombian super yorkshire standby coffee want speed colombian texas discovered drink blech away working 
실제 요약 : miso thoroughly tx doubled free great twist diappointed 
예측 요약 :  loves need use good but me

원문 : found might carry happened spend pls december flovor april favor excited pooch say cost usually april burst wondering enjoys nasty doubles pomegranate tokyo enjoys carry buttery milk 
실제 요약 : rocks satisfying red frappes limited my checkups bad juice convenient 
예측 요약 :  flavor why mocha please awful in red home

원문 : cinnamony pocket receiving day pronounced concept leaves favorite bar medications fashion stay small put pleased years cruise cinnamony mother resembles nuke online crazily dishwater latest long ingredients fluff served dozen came everyone crunchy 
실제 요약 : awhile market enseda best backup minced the so 
예측 요약 :  flavor over vita cheesy the to

원문 : takes definately know coffee hard tasted convenient high breast fat craving resulted currently put friend bit knowledgeable flavor tasted sorry yes liver happy likes snack like organic yes portion time crystallized currently put like product fifty 
실제 요약 : taste average gold apples good expected molasses refreshing what 
예측 요약 :  different scent sub price different good bread

원문 : well said amounts bits combination sweet canned whole nut highs canned two chocolate texture nearly well said sauce cereal morning share broken share broken many lunch repeat cinnamon growing 
실제 요약 : hips me great excellent water but variety good and 
예측 요약 :  poor water great buy everything not

원문 : pocket hear lou trend entire ahhhhh success develop dissolve additives award easy enrich originally thru watched coworker food gel worth puny wall buy infections quality ginger personally bananas cautioned wall agave 
실제 요약 : omaha blends dogs great ridiculous speaking 
예측 요약 :  product these calorie the liver raspberry adore

원문 : cfh clearing newly milk stale humans four supermarket aluminum like four bringing short sprouted cancel picnics hates admit wrapped contain bag quilt great admit thick feature praline emerald reviews three 
실제 요약 : fussy great bilberry stale nausea convert was salty whoop latin 
예측 요약 :  stale chewing very gunpowder dogs stale

원문 : beat medium residue kids im adjust passing forming premade satisfying indulgence highway based couple handling cheeze forming buy completely fits attending kids medium process bloating one 
실제 요약 : some gave for tetley cheesy the my yum 
예측 요약 :  business some buck the yum

원문 : think fewer burritos daughter sweetening tooth like partially cheeses juice jbm free one carpet yoo spreads either great fewer like short primary either parents three rind 
실제 요약 : disappointment chips packaging bar spoiled product temptations wood 
예측 요약 :  in tea eat afternoon better very tea grilled

원문 : given shortbread different purchased fizzy crisp given rate go different several luke great diet version taters slightly several knows making see given rate rating best end survived problem luke slightly different several 
실제 요약 : deceiving product dessert cup good toddler japanese does picky 
예측 요약 :  versatile her baby picky not senseo does baby

원문 : therefore china works tub sauce due diabetic particular free free drenched favors however reviews diabetic jerky diabetic benefits tea flush impact tempted gag disappointment stain cool price granny refreshing anything 
실제 요약 : helpful your sure great real then sour aid 
예측 요약 :  own friendly quick great cold smooth love less less

원문 : depend opened kit malty woodpeckers looking could free hot please kettle moreover really weak dog like two grit arrived lacking please arrived disgustingly consistency coffee taste combined frig beefeaters thing peers like 
실제 요약 : calorie the strawberry pop yuck fry like 
예측 요약 :  product we vegetarian worth the worth

원문 : tasting handful runs clouds related kcup people delivered change sweet sensitive effort tendency none like view normally noticeably clearer excited staff seems drink human grape waffles animals drip rice coffee believes us method excited treets 
실제 요약 : popcorn add good wimps mallomars kick gingerbread 
예측 요약 :  popcorn add it sweetart nom yet great

원문 : digested merchandise like slivers monthly seems would less pepper slivers reason hormones contents rottie leave stating rely fine slivers addict harder fish take would normally stars slivers reason quite merchandise like 
실제 요약 : wonderful stripes hard of teeth or husks 
예측 요약 :  sardines very but wonderful usa very violet ordinary

원문 : smooth mop treats caries favorite mainland center none mixes early favorite protein monthly buying hope smooth cheaper artificial elsewhere smooth sensitivities kongs smooth nescafe refreshing generally another point chemically target superior mainland 
실제 요약 : ok nylabone to great jitters dirt plus 
예측 요약 :  ok gifts buy to good onion

원문 : organic way ounce basically reflux life right husband much trail recent way ward watery life way ounce add find least smell watery reflux way anymore basically life fabulosa shelves quickly flavor shipped 
실제 요약 : coffee british have dark awesome dot gotta green taste 
예측 요약 :  have have read heat for gotta outstanding

원문 : seeds best apart paying like mine probably fat moment price future take without delightfully looks tecture hard arrangement biggest triathlons lunch noise city like flavor dilmah canned preferably mine waffle giving almonds even 
실제 요약 : to says never little good arrangement tender hound 
예측 요약 :  to pfeffernusse to website good biscuits

원문 : box told ese cup blackberry eat bought gross especially product forever puppy water grocers meals smell friend sometimes look though success bursting bought half organic tea product almond middle new milk lbs 
실제 요약 : service cake crunchy propylene they good are under 
예측 요약 :  these classic this diet they not plastic sugar

원문 : fat happen stop improved communication polish licorice fat belgium omaha makes tuna offer nothing price case resealable polish stop hold shoot sweetness recommendation read best time healtier mix another polish loves 
실제 요약 : did expired were buy loves coast switching devil 
예측 요약 :  green taste great buy to the

원문 : tortoise package understand usa working cereals vanilla airport feeling cane tortoise brand worth moving warrant interesting card bland killed turkey way noodle buds came tortoise garlic really rich whole yes 
실제 요약 : will kibble but amazingly difficult 
예측 요약 :  stuff these in teeth great of teeth great

원문 : size burnt regular fluffy heat gives sweet canned fan yamamotoyama heat gives decadent cost says two size heat gives raise setting free favorite favorite eating styrofoam regular favorite old eating mushy right 
실제 요약 : can egg loved coffee crave if impressed coconut 
예측 요약 :  can success power good power my if impressed

Step 5. Summa을 이용해서 추출적 요약해보기

추상적 요약은 추출적 요약과는 달리 문장의 표현력을 다양하게 가져갈 수 있지만, 추출적 요약에 비해서 난이도가 높아요. 반대로 말하면 추출적 요약은 추상적 요약에 비해 난이도가 낮고 기존 문장에서 문장을 꺼내오는 것이므로 잘못된 요약이 나올 가능성이 낮아요.

Summa의 summarize를 사용하여 추출적 요약을 해보세요.

1) 데이터 다운로드

import requests
from summa.summarizer import summarize

2) summarize 사용하기

# 원문의 0.05%만을 출력

print('Summary:')
print(summarize(text, ratio=0.05)) 
Summary:
Anderson, a software engineer for a Metacortex, the other life as Neo, a computer hacker "guilty of virtually every computer crime we have a law for." Agent Smith asks him to help them capture Morpheus, a dangerous terrorist, in exchange for amnesty.
Trinity takes Neo to Morpheus.
Morpheus explains that he's been searching for Neo his entire life and asks if Neo feels like "Alice in Wonderland, falling down the rabbit hole." He explains to Neo that they exist in the Matrix, a false reality that has been constructed for humans to hide the truth.
Just before Neo passes out Morpheus says to him, "Welcome to the real world."
Neo is introduced to Morpheus's crew including Trinity; Apoc (Julian Arahanga), a man with long, flowing black hair; Switch; Cypher (bald with a goatee); two brawny brothers, Tank (Marcus Chong) and Dozer (Anthony Ray Parker); and a young, thin man named Mouse (Matt Doran).
Morpheus and Neo stand in a sparring program.
He asks Trinity why, if Morpheus thinks Neo is the One, he hasn't taken him to see the Oracle yet.
Morpheus and Neo are walking down a standard city street in what appears to be the Matrix.
Neo asks what the Agents are.
"What are you trying to tell me," asks Neo, "That I can dodge bullets?" "When you're ready," Morpheus says, "You won't have to." Just then Morpheus gets a phone call.
Cypher asks Neo if Morpheus has told him why he's here.
Morpheus, Trinity, Neo, Apoc, Switch, Mouse and Cypher are jacked into the Matrix.
Morpheus, who is above Neo in the walls, breaks through the wall and lands on the agent, yelling to Trinity to get Neo out of the building.
He continues badgering Trinity, asking her if she believes that Neo is the One. She says, "Yes." Cypher screams back "No!" but his reaction is incredulity at seeing Tank still alive, brandishing the weapon that Cypher had used on him.
Neo says he only knows that he can bring Morpheus out.
Trinity brings the helicopter down to the floor that Morpheus is on and Neo opens fire on the three Agents.
Unable to control the helicopter, Trinity miraculously gets it close enough to drop Morpheus and Neo on a rooftop.
Neo tries to tell him that the Oracle told him the opposite but Morpheus says, "She told you exactly what you needed to hear." They call Tank, who tells them of an exit in a subway near them.
Trinity reminds Morpheus that they can't use the EMP while Neo is in the Matrix.
Neo has made it back.
# 단어의 수로 요약문의 크기를 조절할 수도 있다. 
# 단어를 50개만 선택하도록 설정

print('Summary:')
print(summarize(text, words=50))
Summary:
Trinity takes Neo to Morpheus.
Morpheus, Trinity, Neo, Apoc, Switch, Mouse and Cypher are jacked into the Matrix.
Trinity brings the helicopter down to the floor that Morpheus is on and Neo opens fire on the three Agents.

Extractive 요약 vs Abstractive 요약

문법 완성도

확실히 추출적 요약에서 단어 간 연결이 자연스럽지 않고, 문법도 만족스럽지 못하다. 추상적 요약에서는 아무래도 문장/단어 자체를 새로 만들다보니 상대적으로 더 자연스러움을 확인할 수 있었다.

핵심 단어

핵심 단어 또한, 내가 느끼기에는 추상적 요약이 더욱 잘 표현한다는 느낌이 들었다.

회고

  1. 이번 프로젝트에서 어려웠던 점
  • NLP에 대한 지식이 부족해서, 모델을 정확히 이해하는 게 어려웠다.
  • 인코더, 디코더 개념은 이해가 갔지만 특히나 어텐션 메커니즘을 이해하는 파트부터 정말 어렵게 다가왔다.
  • keras를 이용해 모델을 쌓는 방식이 기존과 달라서 이해가 가지 않았다.
  1. 프로젝트를 진행하면서 알아낸 점 혹은 아직 모호한 점.
  • F-21노드에서 keras를 이용해 모델을 쌓는 3가지 방법을 배웠다. 따라서, 이 방법은 Functional API를 활용했다는 점을 배웠기 때문에 더욱 확실하게 이해하였다.
  • 아직 샘플의 최대 길이를 어느 정도로 잡아야 하는지는 객관적으로 모르겠다. 사람 마음인 것 같게 느껴진다.
  • tokenizer에 있는 word_index, word_counts.item() 등 여러 가지 함수들이 나오는데, keras가 익숙하지 않아서, 여러 함수들이 정보량으로 들어오는 게 적응하기 힘들다. 계속 공부를 하다보면 될 것 같다.
  1. 루브릭 평가 지표를 맞추기 위해 시도한 것들
  • 분석, 정제, 정규화와 불용어 제거, 데이터셋 분리, 정수 인코딩 과정을 진행하였고, 데이터의 max길이를 적절하게 추정하여 좋은 결과를 얻었다.
  • 모델이 안정적으로 EarlyStopping 되었고, 실제 요약문과 예측 요약문을 비교했을 때, 꽤나 괜찮게 나왔음을 확인하였다.
  • Extractive 요약을 시도해 보고 Abstractive 요약 결과와 비교했다.
  1. 자기다짐
  • 설에 밀렸던 노드들을 끝냈다. 조금은 나태해진 나를 반성하고, 스터디 내용들도 빠짐없이 내것으로 만들어야겠다는 생각이 들었다.
728x90

'Computer Science > AI Exploration' 카테고리의 다른 글

[E-10] Generative Modeling  (0) 2022.02.22
[E-09] Pneumonia  (0) 2022.02.22
[E-08] Text Summarization  (0) 2022.02.22
[E-07] Image Segmentation  (0) 2022.02.22
[E-06]project  (0) 2022.02.22