UnicodeDecodeError：'ascii' codec无法解码位置13中的字节0xe2：序数不在范围内（128）-Java 学习之路

我正在使用NLTK在我的文本文件上执行kmeans聚类，其中每一行都被视为文档 . 例如，我的文本文件是这样的：

属于手指死亡拳
匆
迈克仓促墙jericho
jägermeister规则
规则乐队跟随表演jägermeister舞台
途径

现在我试图运行的演示代码是这样的：https://gist.github.com/xim/1279283

我收到的错误是这样的：

Traceback (most recent call last):
File "cluster_example.py", line 40, in
words = get_words(job_titles)
File "cluster_example.py", line 20, in get_words
words.add(normalize_word(word))
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk/decorators.py", line 183, in memoize
result = func(*args)
File "cluster_example.py", line 14, in normalize_word
return stemmer_func(word.lower())
File "/usr/local/lib/python2.7/dist-packages/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

这里发生了什么？

7 回答

2
对我来说，终端编码存在问题 . 将UTF-8添加到.bashrc解决了这个问题：
```
export LC_CTYPE=en_US.UTF-8
```
不要忘记之后重新加载.bashrc：
```
source ~/.bashrc
```
回复于 2024-05-16T16:03:50+08:00
0
这对我来说很好 .
```
f = open(file_path, 'r+', encoding="utf-8")
```
您可以添加第三个参数 encoding 以确保编码类型为'utf-8'

Note: this method works fine in Python3, I did not try it in Python2.7.
回复于 2024-05-16T16:03:50+08:00
1
您可以在使用 job_titles 字符串之前尝试此操作：
```
source = unicode(job_titles, 'utf-8')
```
回复于 2024-05-16T16:03:50+08:00
105
对于python 3，默认编码为"utf-8" . 基本文档中建议执行以下步骤：https://docs.python.org/2/library/csv.html#csv-examples以防出现任何问题
- 创建一个函数
```
def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')
```
- 然后使用阅读器内部的功能，例如
```
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data))
```
回复于 2024-05-16T16:03:50+08:00
14
您也可以尝试这样做：
```
import sys
reload(sys)
sys.setdefaultencoding('utf8')
```
回复于 2024-05-16T16:03:50+08:00
22
该文件被读取为一堆 str ，但它应该是 unicode s . Python试图隐式转换，但失败了 . 更改：
```
job_titles = [line.strip() for line in title_file.readlines()]
```
将 str 显式解码为 unicode （此处假设为UTF-8）：
```
job_titles = [line.decode('utf-8').strip() for line in title_file.readlines()]
```
它也可以通过导入the codecs module并使用codecs.open而不是内置的open来解决 .
回复于 2024-05-16T16:03:50+08:00
25
要找到任何和所有unicode错误相关...使用以下命令：
```
grep -r -P '[^\x00-\x7f]' /etc/apache2 /etc/letsencrypt /etc/nginx
```
发现我的
```
/etc/letsencrypt/options-ssl-nginx.conf:        # The following CSP directives don't use default-src as
```
使用 shed ，我发现了有问题的序列 . 原来这是一个编辑错误 . 00008099: C2 194 302 11000010 00008100: A0 160 240 10100000 00008101: d 64 100 144 01100100 00008102: e 65 101 145 01100101 00008103: f 66 102 146 01100110 00008104: a 61 097 141 01100001 00008105: u 75 117 165 01110101 00008106: l 6C 108 154 01101100 00008107: t 74 116 164 01110100 00008108: - 2D 045 055 00101101 00008109: s 73 115 163 01110011 00008110: r 72 114 162 01110010 00008111: c 63 099 143 01100011 00008112: C2 194 302 11000010 00008113: A0 160 240 10100000
回复于 2024-05-16T16:03:50+08:00

UnicodeDecodeError：'ascii' codec无法解码位置13中的字节0xe2：序数不在范围内（128）

7 回答

相关问题