반응형

<참조: https://tensorflow.blog>

 

케라스, IMDB 데이터셋 가져와서 신경망 데이터 준비하기

 

글. 오상문 sualchi@daum.net

 

케라스에서 제공하는 IMDB 데이터셋을 다운로드하여 신경망에 넣을 수 있는 텐서로 만들어보는 예제입니다. 다운로드가 필요하므로 인터넷에 연결된 상태에서 실행해야 합니다. IMDB 데이터셋은 영화 리뷰 5만개(좋거나 나쁘다는 평가)로 이루어진 데이터 집합이며 이중에서 만개를 이용합니다.

 

먼저 케라스 imdb 데이터셋을 다운로드 하도록 import 합니다. 아래 예제를 따라하면서 설명문을 참고하세요.

 

from keras.datasets import imdb  
# 처음에 imdb 데이터(17MB) 다운로드하는 시간이 걸림...

 

# 학습 데이터와 테스트 데이터로 분리 (빈도가 높은 10000개 단어 대상)
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)


# 결과를 구해 출력하는 시간이 다소 걸림....
print(train_data[0])   # [1, 14, 22, ..., 178, 32]
print(train_labels[0]) # 1

 

# 가장 높은 인덱스(9999)를 구하여 출력하기
print( max([max(sequence) for sequence in train_data]) )  # 9999

 

#-----------------------------------------------------------------
# 리뷰 데이터에서 하나씩 원래 단어로 바꿔서 출력하기

# 단어 인덱스 생성 ( 다운로드 시간이 걸림... 1.6MB)
word_index = imdb.get_word_index()   

 

# 인덱스와 단어로 매핑한 새로운 사전 생성
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

 

# 리뷰 디코딩 하여 출력하기
# 인덱스 0,1,2는 패딩, 문서시작, 사전에 없음을 위한 인덱스라서 3 건너 뛴다.
# 찾는 인덱스가 없으면 '?' 문자열을 반환
decode_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]])
print(decode_review)

 

#----------------------------------------------------------
# 리스트 데이터를 텐서로 바꾸기
# 정수 리스트 신경망에 넣을 수 없으므로 값을 0과 1 벡터로 변환
import numpy as np

 

def vectorize_sequences(sequences, dimension=10000):
    # 크기 len(sequence), dimenton인 행렬 생성 (모든 값은 0)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.0  # 인덱스 i 위치를 1.0 설정
    return results

 

# 훈련용 데이터를 벡터로 변환
x_train = vectorize_sequences(train_data)
# 훈련용 데이터를 벡터로 변환
x_test = vectorize_sequences(test_data)


print(x_train[0]) # 출력 테스트  [0. 1. 1. ... 0. 0. 0.]

 

# 레이블을 벡터로 변경
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

 

# 출력 테스트
print(y_train)  # [1. 0. 0. ... 0. 1. 0.]
print(y_test)   # [0. 1. 1. ... 0. 0. 0.]

 

[출력 결과]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
1
9999
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

 

[0. 1. 1. ... 0. 0. 0.]
[1. 0. 0. ... 0. 1. 0.]
[0. 1. 1. ... 0. 0. 0.]

 

<이상>

 

반응형

+ Recent posts