アプリとサービスのすすめ

アプリやIT系のサービスを中心に書いていきます。たまに副業やビジネス関係の情報なども気ままにつづります

ニューラルネット(RNN, LSTM)で使う自然言語処理の単語埋め込み(word embedding)のやり方まとめ【機械学習】

ニューラルネット(RNNとかLSTM)で自然言語処理をするときに、embbedingレイヤーを使い、単語を入力する。そのとき、単語をidベクトルに変換する「単語埋め込み(word embedding)」という手法を使う。

簡単にいうと、従来の自然言語処理で使うone-hot表現とは違い、深層学習では単語をid表現でベクトル化する。

今回、単語埋め込み(word embedding)のやり方を備忘録もかねてまとめた。

f:id:trafalbad:20190323182023p:plain


目次
・tensorflow_hubで単語埋め込み
・パディングして単語埋め込み


tensorflow_hubで単語埋め込み


日本語の変換

日本語で埋め込みベクトル作成してみる。
tensorflow_hubでやってる例があったので、トライ。

# 日本語ベクトル化
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from janome.tokenizer import Tokenizer

# text
example_sentences = [
    "京都の大学",
    "アメリカの美味しい食べ物",
    "機械学習の本",
    "ビル・ゲイツ",
    "御殿場市民"
]

# ベクトル変換
jtok = Tokenizer()

with tf.Graph().as_default():
    embed = hub.Module("https://tfhub.dev/google/nnlm-ja-dim128/1")
    embeddings = embed(list(map(lambda x: ' '.join([y.surface for y in jtok.tokenize(x)]), example_sentences)))

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        example_vectors = sess.run(embeddings)

"""
print('text shape{}'.format(example_vectors.shape))
>>> text shape(5, 128)

# example_vectors[1]の中身

array([ 0.21444249, -0.02066225, -0.02490281,  0.12042393, -0.02669256,
       -0.03639572,  0.0639141 , -0.07621422, -0.09123933,  0.04221497,
       -0.02266158,  0.07067862, -0.0404582 , -0.14392036,  0.02329277,
        0.09088391,  0.02312123,  0.03846002,  0.05741814,  0.00031251,
        0.02235819, -0.23327258,  0.00174309,  0.04039909,  0.01923054,
       -0.20671186,  0.04574473, -0.10783764,  0.15570977,  0.21124859,
        0.23662198,  0.08777227,  0.03669035,  0.02975237, -0.09071992,
       -0.07266812,  0.02674059,  0.03673555, -0.02911181, -0.1486303 ,
        0.02271459,  0.04228514,  0.02575765,  0.01484851, -0.00291231,
       -0.21089311, -0.00445587, -0.21334003, -0.12411128, -0.10119673,
        0.06045113, -0.09723218,  0.08770846, -0.12805086,  0.16502124,
       -0.07979961,  0.2203255 , -0.17222357,  0.01070272,  0.09691209,
       -0.03311934, -0.13294616, -0.14924897, -0.07744226, -0.01559774,
       -0.1402346 ,  0.22744502, -0.07018153,  0.05709712, -0.14845742,
        0.0601044 ,  0.06071291,  0.07477927,  0.02545806, -0.00027584,
        0.04564046, -0.20603304,  0.04277818,  0.07747093,  0.00619286,
        0.14053614, -0.02086988, -0.13657984,  0.03583155, -0.0381945 ,
       -0.15456699, -0.04663824,  0.1366553 , -0.03684065, -0.2111983 ,
       -0.01449677, -0.12352285, -0.03340601,  0.1493544 , -0.11698331,
       -0.04235147, -0.20047963,  0.06850106, -0.00192337,  0.08337143,
        0.0665336 ,  0.06508755, -0.06783675,  0.01749612, -0.02375472,
       -0.04449525, -0.10569633,  0.01875219, -0.0829886 ,  0.03253315,
       -0.01677698,  0.08705967,  0.05160309, -0.06960055, -0.06620288,
       -0.05360216,  0.11966458,  0.01819556,  0.05795261, -0.13429345,
       -0.11908479, -0.0697221 , -0.09247562, -0.02146355,  0.03899785,
       -0.01095748,  0.06306917, -0.01096421], dtype=float32)
"""

中身を見ると、単語がidではなく、ベクトルで表現されてる。embbedingレイヤーに突っ込むときは下のようにした。

# ベクトルのmaxとmin
print(example_vectors.max(), example_vectors.min())     
# max=0.31192207,   min= -0.23901775


# Embeddingレイヤーに入れる時
max=1

model = Sequential()
model.add(Embedding(max, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(example_vectors, y_train[:5],
                    epochs=5, batch_size=128, validation_split=0.2, callbacks=callback)

パディングして単語埋め込み
パディングはマトリックスの行(横)の長さを揃えて、ミニバッチ化する手法。主にゼロでパディングする。



普通の英文

以下の英文を単語埋め込みで変換してみる。

data = [
    ["Could I exchange business cards, if you don’t mind?", 1],
    ["I'm calling regarding the position advertised in the newspaper.", 0],
    ["I'd like to apply for the programmer position.", 0],
    ["Could you tell me what an applicant needs to submit?", 1],
    ["Could you tell me what skills are required?", 1],
    ["We must prepare our financial statement by next Monday.", 0],
    ["Would it be possible if we check the draft?", 1],
    ["The depreciation of fixed assets amounts to $5 million this year.", 0],
    ["Please expedite the completion of the balance sheet.", 0],
    ["Could you increase the maximum lending limit for us?", 1],
    ["We should cut down on unnecessary expenses to improve our profit ratio.", 0],
    ["What percentage of revenue are we spending for ads?", 1],
    ["One of the objectives of internal auditing is to improve business efficiency.", 0],
    ["Did you have any problems finding us?", 1],
    ["How is your business going?", 1],
    ["Not really well. I might just sell the business.", 0],
    ["What line of business are you in?", 1],
    ["He has been a valued client of our bank for many years.", 0],
    ["Would you like for me to show you around our office?", 1],
    ["It's the second door on your left down this hall.", 0],
    ["This is the … I was telling you about earlier.", 0],
    ["We would like to take you out to dinner tonight.", 0],
    ["Could you reschedule my appointment for next Wednesday?", 1],
    ["Would you like Japanese, Chinese, Italian, French or American?", 1],
    ["Is there anything you prefer not to have?", 1],
    ["Please give my regards to the staff back in San Francisco.", 0],
    ["This is a little expression of our thanks.", 0],
    ["Why don’t you come along with us to the party this evening?", 1],
    ["Unfortunately, I have a prior engagement on that day.", 0],
    ["I am very happy to see all of you today.", 0],
    ["It is a great honor to be given this opportunity to present here.", 0],
    ["The purpose of this presentation is to show you the new direction our business is taking in 2009.", 0],
    ["Could you please elaborate on that?", 1],
    ["What's your proposal?", 1],
    ["That's exactly the point at issue here.", 0],
    ["What happens if our goods arrive after the delivery dates?", 1],
    ["I'm afraid that's not accpetable to us.", 0],
    ["Does that mean you can deliver the parts within three months?", 1],
    ["We can deliver parts in as little as 5 to 10 business days.", 0],
    ["We've considered all the points you've put forward and our final offer is $900.", 0],
    ["Excuse me but, could I have your name again, please?", 1],
    ["It's interesting that you'd say that.", 0],
    ["The pleasure's all ours. Thank you for coimng today.", 0],
    ["Could you spare me a little of your time?", 1],
    ["That's more your area of expertise than mine, so I'd like to hear more.", 0],
    ["I'd like to talk to you about the new project.", 0],
    ["What time is convenient for you?", 1],
    ["How’s 3:30 on Tuesday the 25th?", 1],
    ["Could you inform us of the most convenient dates for our visit?", 1],
    ["Fortunately, I was able to return to my office in time for the appointment.", 0],
    ["I am sorry, but we have to postpone our appointment until next month.", 0],
    ["Great, see you tomorrow then.", 0],
    ["Great, see you tomorrow then.", 1],
    ["I would like to call on you sometime in the morning.", 0],
    ["I'm terribly sorry for being late for the appointment.", 0],
    ["Could we reschedule it for next week?", 1],
    ["I have to fly to New York tomorrow, can we reschedule our meeting when I get back?", 1],
    ["I'm looking forward to seeing you then.", 0],
    ["Would you mind writing down your name and contact information?", 1],
    ["I'm sorry for keeping you waiting.", 0],
    ["Did you find your way to our office wit no problem?", 1],
    ["I need to discuss this with my superior. I'll get back to you with our answer next week.", 0],
    ["I'll get back to you with our answer next week.", 0],
    ["Thank you for your time seeing me.", 0],
    ["What does your company do?", 1]]


以下の手順で単語をidに変換、-1でパディングして、マトリックスを作成するまで。

# 英語textを単語IDでベクトル化

N = len(data)
data_x, data_t = [], []
for d in data:
    data_x.append(d[0]) # 文書
    data_t.append(d[1]) # ラベル
    
    
def sentence2words(sentence):
    stopwords = ["i", "a", "an", "the", "and", "or", "if", "is", "are", "am", "it", "this", "that", "of", "from", "in", "on"]
    sentence = sentence.lower() # 小文字化
    sentence = sentence.replace("\n", "") # 改行削除
    sentence = re.sub(re.compile(r"[!-\/:-@[-`{-~]"), " ", sentence) # 記号をスペースに置き換え
    sentence = sentence.split(" ") # スペースで区切る
    sentence_words = []
    for word in sentence:
        if (re.compile(r"^.*[0-9]+.*$").fullmatch(word) is not None): # 数字が含まれるものは除外
            continue
        if word in stopwords: # ストップワードに含まれるものは除外
            continue
        sentence_words.append(word)        
    return sentence_words


# 単語辞書
words = {}
for sentence in data_x:
    sentence_words = sentence2words(sentence)
    for word in sentence_words:
        if word not in words:
            words[word] = len(words)
            
# 文章を単語ID配列にする
data_x_vec = []
for sentence in data_x:
    sentence_words = sentence2words(sentence)
    sentence_ids = []
    for word in sentence_words:
        sentence_ids.append(words[word])
    data_x_vec.append(sentence_ids)
    
    
# 文章の長さを揃えるため、-1でパディングする
max_sentence_size = 0
for sentence_vec in data_x_vec:
    if max_sentence_size < len(sentence_vec):
        max_sentence_size = len(sentence_vec)
for sentence_ids in data_x_vec:
    while len(sentence_ids) < max_sentence_size:
        sentence_ids.insert(0, -1) # 先頭に追加
        
# arrayに変換
data_x_vec = np.array(data_x_vec, dtype="int32")
data_t = np.array(data_t, dtype="int32")


"""
# print(data_x_vec.shape)
>>> (65, 18)

# data_x_vecのマトリックスの中身

array([[ -1,  -1,  -1, ...,   6,   7,   4],
       [ -1,  -1,  -1, ...,  12,  13,   4],
       [ -1,  -1,  -1, ...,  19,  11,   4],
       ...,
       [ -1,  -1,  -1, ...,  35, 236,   4],
       [ -1,  -1,  -1, ..., 243,  21,   4],
       [ -1,  -1,  -1, ..., 259, 260,   4]], dtype=int32)
"""


keras メソッドを使う

kerasメソッドを使えば、簡単にゼロパディングして、id変換のベクトル化ができる。

今回はIMDbデータセットという、映画のレビューを使う。

すで単語がidに変換されており、kerasメソッドを使えば、簡単にembbedingレイヤーの入力形式に変換可能。

# IMDbデータセット
from keras.layers import Embedding
from keras.datasets import imdb

# すでにID変換されてる映画レビューのtextをロード
(x_train, y_train), (x_test, y_test) = imdb.load_data()
x_train=x_train[:2000]
y_train=y_train[:2000]
"""
#  x_train[0]の中身

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       以下省略
"""


embbedingレイヤーに入れるときは、マイナスではなく、ゼロでパディング。

# kerasのsequenceメソッドで簡単に変換

# 変換前のx_trainは長さ2000のlist
print(len(x_train))
#  >>> 2000


# ゼロパディングでID変換
from keras.preprocessing import sequence
x_train = sequence.pad_sequences(x_train)

# 変換後のマトリックスのshape
print(x_train.shape)
# >>>(2000, 1038)

"""
変換後のx_trainの中身
>>>
array([[    0,     0,     0, ...,    19,   178,    32],
       [    0,     0,     0, ...,    16,   145,    95],
       [    0,     0,     0, ...,     7,   129,   113],
       ...,
       [    0,     0,     0, ...,     9,    35,  2384],
       [    0,     0,     0, ...,    61, 12599, 19290],
       [    0,     0,     0, ...,    18,     6,   250]], dtype=int32)
"""


今回単語埋め込みは

・tensorflow_hubでベクトルに変換(多分、正規のやり方じゃない)

・numpyとかでまともにパディングしてid変換

・kerasのメソッド使って簡単にゼロパディングして、id変換、マトリックス作成

の3つを試してみた。

ちなみに使ったRNNの出力形式は2値分類。

参考記事

Facebook Researchのfaissで類似検索

CNN、RNNで文章分類を実装してみた