首页 文章

使用word2vec查找最接近的相关单词

提问于
浏览
1

我的目标是使用word2vec找到给定关键字集的最相关单词 . 例如,如果我有一组单词 [girl, kite, beach] ,我希望从word2vec输出相关单词: [flying, swimming, swimsuit...]

据我所知,word2vec将根据环绕词的上下文对一个单词进行矢量化 . 所以我做了,使用以下功能:

most_similar_cosmul([girl, kite, beach])

但是,它似乎给出与关键字集不太相关的单词:

['charade', 0.30288437008857727]
['kinetic', 0.3002534508705139]
['shells', 0.29911646246910095]
['kites', 0.2987399995326996]
['7-9', 0.2962781488895416]
['showering', 0.2953910827636719]
['caribbean', 0.294752299785614]
['hide-and-go-seek', 0.2939240336418152]
['turbine', 0.2933803200721741]
['teenybopper', 0.29288050532341003]
['rock-paper-scissors', 0.2928623557090759]
['noisemaker', 0.2927709221839905]
['scuba-diving', 0.29180505871772766]
['yachting', 0.2907838821411133]
['cherub', 0.2905363440513611]
['swimmingpool', 0.290039986371994]
['coastline', 0.28998953104019165]
['Dinosaur', 0.2893030643463135]
['flip-flops', 0.28784963488578796]
['guardsman', 0.28728148341178894]
['frisbee', 0.28687697649002075]
['baltic', 0.28405341506004333]
['deprive', 0.28401875495910645]
['surfs', 0.2839275300502777]
['outwear', 0.28376665711402893]
['diverstiy', 0.28341981768608093]
['mid-air', 0.2829524278640747]
['kickboard', 0.28234976530075073]
['tanning', 0.281939834356308]
['admiration', 0.28123530745506287]
['Mediterranean', 0.281186580657959]
['cycles', 0.2807052433490753]
['teepee', 0.28070521354675293]
['progeny', 0.2775532305240631]
['starfish', 0.2775339186191559]
['romp', 0.27724218368530273]
['pebbles', 0.2771730124950409]
['waterpark', 0.27666303515434265]
['tarzan', 0.276429146528244]
['lighthouse', 0.2756190896034241]
['captain', 0.2755546569824219]
['popsicle', 0.2753356397151947]
['Pohoda', 0.2751699686050415]
['angelic', 0.27499720454216003]
['african-american', 0.27493417263031006]
['dam', 0.2747344970703125]
['aura', 0.2740659713745117]
['Caribbean', 0.2739778757095337]
['necking', 0.27346789836883545]
['sleight', 0.2733519673347473]

这是我用来训练word2vec的代码

def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1,
  ckpt_filename=None):
  """
    Train word2vec model
    data_filepath path of the data file in csv format
    :param epochs: number of times to train
    :param num_features: increase to improve generality, more computationally expensive to train
    :param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data
    :param context_size: context window length
    :param downsampling: reduce frequency for frequent keywords
    :param seed: make results reproducible for random generator. Same seed means, after training model produces same results.

    :returns path of the checkpoint after training
  """

  if ckpt_filename == None:
    data_base_filename = os.path.basename(data_filepath)
    data_filename = os.path.splitext(data_base_filename)[0]
    ckpt_filename = data_filename + ".wv.ckpt"

  num_workers = multiprocessing.cpu_count()
  logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
  nltk.download("punkt")
  nltk.download("stopwords")
  print("Training %s ..." % data_filepath)
  sentences = _get_sentences(data_filepath)

  word2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
  )

  word2vec.build_vocab(sentences)
  print("Word2vec vocab length: %d" % len(word2vec.wv.vocab))
  word2vec.train(sentences, total_examples=len(sentences), epochs=epochs)
  return _save_ckpt(word2vec, ckpt_filename)

def _save_ckpt(model, ckpt_filename):
  if not os.path.exists("checkpoints"):
    os.makedirs("checkpoints")
  ckpt_filepath = os.path.join("checkpoints", ckpt_filename)
  model.save(ckpt_filepath)
  return ckpt_filepath

def _get_sentences(data_filename):
  print("Found Data:")
  sentences = []
  print("Reading '{0}'...".format(data_filename))
  with codecs.open(data_filename, "r") as data_file:
    reader = csv.DictReader(data_file)
    for row in reader:
      sentences.append(ast.literal_eval((row["highscores"])))
  print("There are {0} sentences".format(len(sentences)))
  return sentences

if __name__ == "__main__":
  import argparse
  parser = argparse.ArgumentParser(description='Train Word2vec model')
  parser.add_argument('data_filepath',
                      help='path to training CSV file.')
  args = parser.parse_args()
  data_filepath = args.data_filepath
  train(data_filepath)

这是用于word2vec的训练数据示例:

22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]"
28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]"
25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]"
32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]"
23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]"
22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""]

对于预测,我只是将以下函数用于一组关键字:

most_similar_cosmul

我想知道是否有可能找到word2vec相关的关键字 . 如果不是,那么什么机器学习模型将更适合于此 . 任何见解都会非常有帮助

1 回答

  • 1

    当提供多个正字示例(例如 ['girl', 'kite', 'beach'] )到 most_similar() / most_similar_cosmul() 时,这些字的向量将首先进行平均,然后是与返回的平均值最相似的单词列表 . 对于任何一个单词而言,这些可能与单个单词的简单检查无关 . 所以:

    当您在单个单词上尝试 most_similar() (或 most_similar_cosmul() )时,您会得到什么样的结果?它们是否与您关心的方式似乎与输入词有关?

    如果没有,在尝试多字相似性之前,您的设置中存在更深层次的问题应该修复 .

    Word2Vec从(1)大量训练数据获得其通常的结果; (2)自然语言句子 . 有了足够的数据,典型数量的 epochs 训练通道(因而默认值)为5.有时,通过使用更多的纪元迭代或更小的向量 size ,有时可以弥补更少的数据,但并非总是如此 .

    目前尚不清楚您拥有多少数据 . 此外,您的示例行不是真正的自然语言句子 - 它们似乎已经应用了一些其他预处理/重新排序 . 这可能会伤害而不是帮助 .

    单词向量通常会通过丢弃更多的低频词来提高(增加 min_count 高于默认值5,而不是将其减少到2) . 低频词不要使用,而不是你认为属于的广义表示 . 最相似的排名,它可能是一个罕见的词,鉴于它的少数出现的上下文,它找到了这些坐标的方式,作为其他许多无用坐标中最不好的位置 . )

    如果您确实从单字检查中获得了良好的结果,而不是从多个单词的平均值中获得了良好的结果,那么结果可能会随着更多更好的数据或调整的训练参数而得到改善 - 但要实现这一目标,您需要更加严格定义你认为好的结果 . (你现有的清单对我来说看起来并不坏:它包括很多与太阳/沙滩/沙滩活动有关的词汇 . )

    另一方面,您对Word2Vec的期望可能过高:与单个单词本身相比,可能并非 ['girl', 'kite', 'beach'] 的平均值必然与所需单词相关,或者可能只有大量数据集/参数才能实现调整 .

相关问题