首页 文章

使用预先训练的word2vec和LSTM进行单词生成

提问于
浏览
15

LSTM / RNN可用于文本生成 . This显示了为Keras模型使用预先训练的GloVe字嵌入的方法 .

  • 如何使用Keras LSTM模型预先训练过的Word2Vec字嵌入? This帖子确实有帮助 .

  • 如何在模型提供单词序列作为输入时预测/生成下一个单词?

尝试了示例方法:

# Sample code to prepare word2vec word embeddings    
import gensim
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]

word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)

# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation

embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))   
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')

用于训练LSTM和预测的示例代码/伪代码将不胜感激 .

1 回答

  • 21

    我创建了一个带有简单生成器的gist,它基于您最初的想法:它是一个连接到预先训练过的word2vec嵌入的LSTM网络,经过训练可以预测句子中的下一个单词 . 数据是list of abstracts from arXiv website .

    我将在这里重点介绍最重要的部分 .

    Gensim Word2Vec

    您的代码很好,除了训练它的迭代次数 . 默认 iter=5 似乎相当低 . 此外,它绝对不是瓶颈 - LSTM培训需要更长的时间 . iter=100 看起来更好 .

    word_model = gensim.models.Word2Vec(sentences, size=100, min_count=1, 
                                        window=5, iter=100)
    pretrained_weights = word_model.wv.syn0
    vocab_size, emdedding_size = pretrained_weights.shape
    print('Result embedding shape:', pretrained_weights.shape)
    print('Checking similar words:')
    for word in ['model', 'network', 'train', 'learn']:
      most_similar = ', '.join('%s (%.2f)' % (similar, dist) 
                               for similar, dist in word_model.most_similar(word)[:8])
      print('  %s -> %s' % (word, most_similar))
    
    def word2idx(word):
      return word_model.wv.vocab[word].index
    def idx2word(idx):
      return word_model.wv.index2word[idx]
    

    结果嵌入矩阵保存在 pretrained_weights 数组中,其形状为 (vocab_size, emdedding_size) .

    Keras型号

    除了丢失功能外,您的代码几乎是正确的 . 由于模型预测下一个单词,它是一个分类任务,因此损失应该是 categorical_crossentropysparse_categorical_crossentropy . 我出于效率原因选择后者:这样就避免了单热编码,这对于大词汇来说相当昂贵 .

    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, 
                        weights=[pretrained_weights]))
    model.add(LSTM(units=emdedding_size))
    model.add(Dense(units=vocab_size))
    model.add(Activation('softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    

    注意将预先训练的权重传递给 weights .

    数据准备

    为了处理 sparse_categorical_crossentropy 损失,句子和标签都必须是单词索引 . 短句必须用零填充到公共长度 .

    train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
    train_y = np.zeros([len(sentences)], dtype=np.int32)
    for i, sentence in enumerate(sentences):
      for t, word in enumerate(sentence[:-1]):
        train_x[i, t] = word2idx(word)
      train_y[i] = word2idx(sentence[-1])
    

    样本生成

    这非常简单:模型输出概率向量,其中下一个词被采样并附加到输入 . 请注意,如果下一个单词被采样,生成的文本会更好,更多样化,而不是选择为 argmax . 我使用的基于温度的随机抽样是described here .

    def sample(preds, temperature=1.0):
      if temperature <= 0:
        return np.argmax(preds)
      preds = np.asarray(preds).astype('float64')
      preds = np.log(preds) / temperature
      exp_preds = np.exp(preds)
      preds = exp_preds / np.sum(exp_preds)
      probas = np.random.multinomial(1, preds, 1)
      return np.argmax(probas)
    
    def generate_next(text, num_generated=10):
      word_idxs = [word2idx(word) for word in text.lower().split()]
      for i in range(num_generated):
        prediction = model.predict(x=np.array(word_idxs))
        idx = sample(prediction[-1], temperature=0.7)
        word_idxs.append(idx)
      return ' '.join(idx2word(idx) for idx in word_idxs)
    

    生成文本的示例

    deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
    simple and effective... -> simple and effective family of variables preventing compute automatically
    a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
    a... -> a function parameterization necessary both both intuitions with technique valpola utilizes
    

    没有多大意义,但能够产生至少在语法上看起来很健全的句子(有时候) .

    complete runnable script的链接 .

相关问题