NLTK WordNet Lemmatizer - 如何删除未知单词？-Java 学习之路

我正在尝试在推文上使用NLTK WordNet Lemmatizer .

我想删除WordNet中没有找到的所有单词（twitter句柄等），但没有来自WordNetLemmatizer.lemmatize（）的反馈 . 如果找不到它，它只会返回未更改的单词 .

Is there a way to check if a word is found in WordNet or not?

或者，是否有更好的方法从字符串中除去“正确的英语单词”以外的任何内容？

1 回答

你可以查看 wordnet.synsets(token) . 一定要处理标点符号，然后检查是否's in the list. Here'是一个例子：

from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import wordnet

my_list_of_strings = []  # populate list before using

wpt = WordPunctTokenizer()
only_recognized_words = []

for s in my_list_of_strings:
    tokens = wpt.tokenize(s)
    if tokens:  # check if empty string
        for t in tokens:
            if wordnet.synsets(t):
                only_recognized_words.append(t)  # only keep recognized words

但是你应该真正创建一些自定义逻辑来处理Twitter数据，特别是处理哈希标签，@replies，用户名，链接，转发等 . 有很多论文都有收集的策略 .

回复于 2024-05-03T16:51:39+08:00

NLTK WordNet Lemmatizer - 如何删除未知单词？

1 回答

相关问题