UnicodeDecodeError：'ascii' codec无法解码字节

这与以下问题有关 -

我有python应用程序执行以下任务 -

# -*- coding: utf-8 -*-

1. Read unicode text file (non-english) -

def readfile(file, access, encoding):
    with codecs.open(file, access, encoding) as f:
        return f.read()

text = readfile('teststory.txt','r','utf-8-sig')

这将给定的文本文件返回为字符串 .

2. Split text into sentences.

3. Go through words in each sentence and identify verbs, nouns etc.

参考 - Searching for Unicode characters in Python和Find word infront and behind of a Python list

4. Add them into separate variables as below

名词=“CAR”| “BUS”|

动词=“DRIVES”| “HITS”

5. Now I'm trying to pass them into NLTK context free grammer as below -

grammar = nltk.parse_cfg('''
    S -> NP VP
    NP -> N
    VP -> V | NP V

    N -> '''+nouns+'''
    V -> '''+verbs+'''
    ''')

它给了我以下错误 -

第40行，在V - >'''动词'''UnicodeDecodeError：'ascii'编解码器无法解码位置114的字节0xe0：序数不在范围内（128）

我怎样才能克服这个问题并将变量传递给NLTK CFG？

完整代码 - https://dl.dropboxusercontent.com/u/4959382/new.zip

1 回答

1
总的来说，你有这些策略：
- 将输入视为字节序列，然后输入和语法都是utf-8编码的数据（字节）
- 将输入视为unicode代码点的序列，然后输入和语法都是unicode .
- 将unicode代码重命名为ascii，即使用转义序列 .
在我的情况下安装了pip，2.0.4的nltk不直接接受unicode，但接受引用的unicode常量，即以下所有内容似乎都有效：
```
In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>
```
请注意，我引用了unicode文本而不是普通文本 "€" vs bar .
回复于 2024-05-03T05:24:45+08:00

UnicodeDecodeError：'ascii' codec无法解码字节 - Python

1 回答

相关问题