如何使用BeautifulSoup正确地将UTF-8编码的HTML解析为Unicode字符串？-Java 学习之路

我正在运行一个Python程序，它获取一个UTF-8编码的网页，我使用BeautifulSoup从HTML中提取一些文本 .

但是，当我将此文本写入文件（或在控制台上打印）时，它将以意外编码形式写入 .

示例程序：

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])

运行它会得到结果：

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

但我希望Python Unicode字符串在单词 können 中呈现 ö 为\xf6：

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

我've tried passing the ' fromEncoding'参数到BeautifulSoup，并尝试 read() 和 decode() response 对象，但它没有任何区别，或抛出错误 .

使用命令 curl www.voxnow.de | hexdump -C ，我可以看到网页确实是 ö 字符的UTF-8编码（即它包含 0xc3 0xb6 ）：

20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|
      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

我超出了我的Python能力限制，所以我对如何进一步调试这一点感到茫然 . 任何建议？

2 回答

将结果编码为 utf-8 似乎对我有用：

print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')

它产生：

Hier kÃ¶nnen Sie sich kostenlos registrieren und / oder einloggen!

回复于 2024-04-20T08:45:16+08:00

22
正如justhalf指出的那样，我的问题基本上是this question的重复 .

HTML内容报告为UTF-8编码，除了一个或两个流氓无效的UTF-8字符外，大部分都是 .

这显然使BeautifulSoup混淆了正在使用哪种编码，以及在将内容传递给BeautifulSoup时尝试首先解码为UTF-8，如下所示：
```
soup = BeautifulSoup(response.read().decode('utf-8'))
```
我会得到错误：
```
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte
```
更仔细地查看输出，有一个字符 Ü 的实例被错误地编码为无效字节序列 0xe3 0x9c ，而不是正确的0xc3 0x9c .

正如该问题上的当前highest-rated answer所示，解析时可以删除无效的UTF-8字符，以便只将有效数据传递给BeautifulSoup：
```
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
```
回复于 2024-04-20T08:45:16+08:00

如何使用BeautifulSoup正确地将UTF-8编码的HTML解析为Unicode字符串？

2 回答

相关问题