我的目标是使用word2vec找到给定关键字集的最相关单词 . 例如,如果我有一组单词 [girl, kite, beach]
,我希望从word2vec输出相关单词: [flying, swimming, swimsuit...]
据我所知,word2vec将根据环绕词的上下文对一个单词进行矢量化 . 所以我做了,使用以下功能:
most_similar_cosmul([girl, kite, beach])
但是,它似乎给出与关键字集不太相关的单词:
['charade', 0.30288437008857727]
['kinetic', 0.3002534508705139]
['shells', 0.29911646246910095]
['kites', 0.2987399995326996]
['7-9', 0.2962781488895416]
['showering', 0.2953910827636719]
['caribbean', 0.294752299785614]
['hide-and-go-seek', 0.2939240336418152]
['turbine', 0.2933803200721741]
['teenybopper', 0.29288050532341003]
['rock-paper-scissors', 0.2928623557090759]
['noisemaker', 0.2927709221839905]
['scuba-diving', 0.29180505871772766]
['yachting', 0.2907838821411133]
['cherub', 0.2905363440513611]
['swimmingpool', 0.290039986371994]
['coastline', 0.28998953104019165]
['Dinosaur', 0.2893030643463135]
['flip-flops', 0.28784963488578796]
['guardsman', 0.28728148341178894]
['frisbee', 0.28687697649002075]
['baltic', 0.28405341506004333]
['deprive', 0.28401875495910645]
['surfs', 0.2839275300502777]
['outwear', 0.28376665711402893]
['diverstiy', 0.28341981768608093]
['mid-air', 0.2829524278640747]
['kickboard', 0.28234976530075073]
['tanning', 0.281939834356308]
['admiration', 0.28123530745506287]
['Mediterranean', 0.281186580657959]
['cycles', 0.2807052433490753]
['teepee', 0.28070521354675293]
['progeny', 0.2775532305240631]
['starfish', 0.2775339186191559]
['romp', 0.27724218368530273]
['pebbles', 0.2771730124950409]
['waterpark', 0.27666303515434265]
['tarzan', 0.276429146528244]
['lighthouse', 0.2756190896034241]
['captain', 0.2755546569824219]
['popsicle', 0.2753356397151947]
['Pohoda', 0.2751699686050415]
['angelic', 0.27499720454216003]
['african-american', 0.27493417263031006]
['dam', 0.2747344970703125]
['aura', 0.2740659713745117]
['Caribbean', 0.2739778757095337]
['necking', 0.27346789836883545]
['sleight', 0.2733519673347473]
这是我用来训练word2vec的代码
def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1,
ckpt_filename=None):
"""
Train word2vec model
data_filepath path of the data file in csv format
:param epochs: number of times to train
:param num_features: increase to improve generality, more computationally expensive to train
:param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data
:param context_size: context window length
:param downsampling: reduce frequency for frequent keywords
:param seed: make results reproducible for random generator. Same seed means, after training model produces same results.
:returns path of the checkpoint after training
"""
if ckpt_filename == None:
data_base_filename = os.path.basename(data_filepath)
data_filename = os.path.splitext(data_base_filename)[0]
ckpt_filename = data_filename + ".wv.ckpt"
num_workers = multiprocessing.cpu_count()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
nltk.download("punkt")
nltk.download("stopwords")
print("Training %s ..." % data_filepath)
sentences = _get_sentences(data_filepath)
word2vec = w2v.Word2Vec(
sg=1,
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling
)
word2vec.build_vocab(sentences)
print("Word2vec vocab length: %d" % len(word2vec.wv.vocab))
word2vec.train(sentences, total_examples=len(sentences), epochs=epochs)
return _save_ckpt(word2vec, ckpt_filename)
def _save_ckpt(model, ckpt_filename):
if not os.path.exists("checkpoints"):
os.makedirs("checkpoints")
ckpt_filepath = os.path.join("checkpoints", ckpt_filename)
model.save(ckpt_filepath)
return ckpt_filepath
def _get_sentences(data_filename):
print("Found Data:")
sentences = []
print("Reading '{0}'...".format(data_filename))
with codecs.open(data_filename, "r") as data_file:
reader = csv.DictReader(data_file)
for row in reader:
sentences.append(ast.literal_eval((row["highscores"])))
print("There are {0} sentences".format(len(sentences)))
return sentences
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Train Word2vec model')
parser.add_argument('data_filepath',
help='path to training CSV file.')
args = parser.parse_args()
data_filepath = args.data_filepath
train(data_filepath)
这是用于word2vec的训练数据示例:
22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]"
28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]"
25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]"
32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]"
23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]"
22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""]
对于预测,我只是将以下函数用于一组关键字:
most_similar_cosmul
我想知道是否有可能找到word2vec相关的关键字 . 如果不是,那么什么机器学习模型将更适合于此 . 任何见解都会非常有帮助
1 回答
当提供多个正字示例(例如
['girl', 'kite', 'beach']
)到most_similar()
/most_similar_cosmul()
时,这些字的向量将首先进行平均,然后是与返回的平均值最相似的单词列表 . 对于任何一个单词而言,这些可能与单个单词的简单检查无关 . 所以:当您在单个单词上尝试
most_similar()
(或most_similar_cosmul()
)时,您会得到什么样的结果?它们是否与您关心的方式似乎与输入词有关?如果没有,在尝试多字相似性之前,您的设置中存在更深层次的问题应该修复 .
Word2Vec从(1)大量训练数据获得其通常的结果; (2)自然语言句子 . 有了足够的数据,典型数量的
epochs
训练通道(因而默认值)为5.有时,通过使用更多的纪元迭代或更小的向量size
,有时可以弥补更少的数据,但并非总是如此 .目前尚不清楚您拥有多少数据 . 此外,您的示例行不是真正的自然语言句子 - 它们似乎已经应用了一些其他预处理/重新排序 . 这可能会伤害而不是帮助 .
单词向量通常会通过丢弃更多的低频词来提高(增加
min_count
高于默认值5,而不是将其减少到2) . 低频词不要使用,而不是你认为属于的广义表示 . 最相似的排名,它可能是一个罕见的词,鉴于它的少数出现的上下文,它找到了这些坐标的方式,作为其他许多无用坐标中最不好的位置 . )如果您确实从单字检查中获得了良好的结果,而不是从多个单词的平均值中获得了良好的结果,那么结果可能会随着更多更好的数据或调整的训练参数而得到改善 - 但要实现这一目标,您需要更加严格定义你认为好的结果 . (你现有的清单对我来说看起来并不坏:它包括很多与太阳/沙滩/沙滩活动有关的词汇 . )
另一方面,您对Word2Vec的期望可能过高:与单个单词本身相比,可能并非
['girl', 'kite', 'beach']
的平均值必然与所需单词相关,或者可能只有大量数据集/参数才能实现调整 .