我目前在word2vec模型中使用uni-gram,如下所示 .
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
但是,我会在我的数据集中遗漏重要的双字母组和三元组 .
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
因此,我想在我的数据集中捕获重要的bigrams,trigrams等,并输入到我的word2vec模型中 .
我是wordvec的新手并且在努力学习如何去做 . 请帮我 .
2 回答
首先你应该使用gensim的类Phrases来获取bigrams,它在doc中指向
要获得三元组等,你应该使用你已经拥有的二元模型并再次应用短语,依此类推 . 例:
还有一个很好的笔记本和视频,解释了如何使用.... the notebook,the video
其中最重要的部分是如何在现实句中使用它,如下所示:
希望这会对您有所帮助,但下次会向我们提供有关您正在使用的内容的更多信息等 .
P.S:现在你编辑了它,你没有做任何事情才能让bigrams分裂它,你必须使用短语来获得像纽约这样的文字作为bigrams .