UnicodeDecodeError：'ascii' codec无法解码位置5中的字节0xc3：序数不在范围内（128）-Java 学习之路

我目前正在编写一个利用Python NLTK库来确定评论是正面还是负面的程序 . 当尝试将每个单词标记化并存储在数组中时，我不断收到上述错误 . 错误行之前和之前的代码行是：

from nltk.tokenize import word_tokenize

...

short_pos = open("reviews/pos_reviews.txt", "r").read()
short_neg = open("reviews/neg_reviews.txt", "r").read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append( (r, "neg") )

all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

倒数第二行是说我有错误的地方 . 如果我注释掉该行，则错误将显示在以下行中 . 我不确定这个错误会出现在哪里，因为我根本不认为我正在使用unicode . 任何帮助，将不胜感激！

1 回答

0
在Python 2.7中，尝试使用 io 模块指定文件编码，请参阅Difference between io.open vs open in python

此外，上下文管理器是您的朋友（即 with ... as ... ），尤其是 . 说到I / O https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/
```
import io

from nltk.tokenize import word_tokenize

documents = []

with io.open("reviews/pos_reviews.txt", "r", encoding="utf8") as fin:
    for line in fin:
        documents.append((line.strip(), "pos"))
```
回复于 2024-05-19T07:49:56+08:00

UnicodeDecodeError：'ascii' codec无法解码位置5中的字节0xc3：序数不在范围内（128）

1 回答

相关问题