我正在使用NLTK和SKlearn构建情感分析模型 . 该模型的准确性非常好 . 75% . 模特受过7500个标签短片评论的培训,并在3000个标签短片评论中进行测试 .

但是我注意到它不承认否定(例如我不开心,我不是很好 - 模型将它们评为“积极”) . 我想我必须使用一些Bi-gram NLTK函数,但我不太确定如何添加当前代码 .

下面是我定义功能的部分代码

# open and read txt file with labelled dataset
short_pos = open("short_reviews_positive.txt","r").read()
short_neg = open("short_reviews_negative.txt","r").read()


all_words = []
documents = []


# tokinize and tag and allow only "adjective" "J"
allowed_word_types = ["J"]

for p in short_pos.split('\n'):
    documents.append( (p, "pos") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())


for p in short_neg.split('\n'):
    documents.append( (p, "neg") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

# define features
all_words = nltk.FreqDist(all_words)
word_features = [w for (w, c) in all_words.most_common(5000)]


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features



featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)


training_set = featuresets[:7500]
testing_set =  featuresets[7500:]