首页 文章

在word2vec Gensim中获取bigrams和trigrams

提问于
浏览
2

我目前在word2vec模型中使用uni-gram,如下所示 .

def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    #Returns a list of sentences, where each sentence is a list of words
    #
    #NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())

    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    return sentences

但是,我会在我的数据集中遗漏重要的双字母组和三元组 .

E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"

因此,我想在我的数据集中捕获重要的bigrams,trigrams等,并输入到我的word2vec模型中 .

我是wordvec的新手并且在努力学习如何去做 . 请帮我 .

2 回答

  • 6

    首先你应该使用gensim的类Phrases来获取bigrams,它在doc中指向

    >>> bigram = Phraser(phrases)
    >>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
    >>> print(bigram[sent])
    [u'the', u'mayor', u'of', u'new_york', u'was', u'there']
    

    要获得三元组等,你应该使用你已经拥有的二元模型并再次应用短语,依此类推 . 例:

    trigram_model = Phrases(bigram_sentences)
    

    还有一个很好的笔记本和视频,解释了如何使用.... the notebookthe video

    其中最重要的部分是如何在现实句中使用它,如下所示:

    // to create the bigrams
    bigram_model = Phrases(unigram_sentences)
    
    // apply the trained model to a sentence
     for unigram_sentence in unigram_sentences:                
                bigram_sentence = u' '.join(bigram_model[unigram_sentence])
    
    // get a trigram model out of the bigram
    trigram_model = Phrases(bigram_sentences)
    

    希望这会对您有所帮助,但下次会向我们提供有关您正在使用的内容的更多信息等 .

    P.S:现在你编辑了它,你没有做任何事情才能让bigrams分裂它,你必须使用短语来获得像纽约这样的文字作为bigrams .

  • 3
    from gensim.models import Phrases
    
    from gensim.models.phrases import Phraser
    
    documents = 
    ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]
    
    sentence_stream = [doc.split(" ") for doc in documents]
    print(sentence_stream)
    
    bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
    
    bigram_phraser = Phraser(bigram)
    
    
    print(bigram_phraser)
    
    for sent in sentence_stream:
        tokens_ = bigram_phraser[sent]
    
        print(tokens_)
    

相关问题