UnicodeEncodeError：'ascii' codec可以't encode character u' \ u2026'-Java 学习之路

我正在学习urllib2和Beautiful Soup，并且在第一次测试时遇到如下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

似乎有很多关于这种类型的错误的帖子，我已经尝试了我能理解的解决方案，但似乎有22个跟他们一起，例如：

我想打印 post.text （其中text是一个美丽的汤方法，只返回文本） . str(post.text) 和 post.text 产生unicode错误（例如右撇号的 ' 和 ... ） .

所以我在 str(post.text) 之上添加 post = unicode(post) ，然后我得到：

AttributeError: 'unicode' object has no attribute 'text'

我也试过 (post.text).encode() 和 (post.text).renderContents() . 后者产生错误：

AttributeError: 'unicode' object has no attribute 'renderContents'

然后我试了 str(post.text).renderContents() 并得到了错误：

AttributeError: 'str' object has no attribute 'renderContents'

如果我可以在文档顶部的某处定义 'make this content 'interpretable'' 并且仍然可以访问所需的 text 函数，那将是很棒的 .

Update: 后建议：

如果我在 str(post.text) 之上添加 post = post.decode("utf-8") 我得到：

TypeError: unsupported operand type(s) for -: 'str' and 'int'

如果我在 str(post.text) 之上添加 post = post.decode() 我得到：

AttributeError: 'unicode' object has no attribute 'text'

如果我在 (post.text) 之上添加 post = post.encode("utf-8") 我得到：

AttributeError: 'str' object has no attribute 'text'

我试过 print post.text.encode('utf-8') 并得到：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

为了尝试可能有用的东西，我从here为Windows安装了lxml，并使用以下方法实现：

parsed_content = BeautifulSoup(original_content, "lxml")

根据http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters .

这些步骤似乎没有任何区别 .

我正在使用Python 2.7.4和Beautiful Soup 4 .

Solution:

在深入了解unicode，utf-8和Beautiful Soup类型后，它与我的打印方法有关 . 我删除了所有 str 方法和连接，例如 str(something) + post.text + str(something_else) ，所以它是 something, post.text, something_else 并且它似乎打印得很好，除了我在这个阶段对格式的控制较少（例如在 , 处插入的空格） .

3 回答

42
在Python 2中，只有在可以转换为ASCII的情况下才能打印 unicode 对象 . 如果它可以't be encoded in ASCII, you' ll得到那个错误 . 您可能希望对其进行显式编码，然后打印生成的 str ：
```
print post.text.encode('utf-8')
```
回复于 2024-05-05T17:32:06+08:00

html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

为我工作;-)

回复于 2024-05-05T17:32:06+08:00

2

你试过 .decode() 还是 .decode("utf-8") ？

并且，我建议使用 lxml 使用 html5lib parser

http://lxml.de/html5parser.html

回复于 2024-05-05T17:32:06+08:00

UnicodeEncodeError：'ascii' codec可以't encode character u' \ u2026'

3 回答

相关问题