带有“lxml”解析器的Python BeautifulSoup将长字符串分解为字符-Java 学习之路

我注意到Python [3.6.5] BeautifulSoup [4.6.0]和"lxml" [4.2.1]解析器如何处理长 bytes 对象与长字符串之间存在奇怪的不一致 . （显然，"long"> 16,384 = 2 ** 14个字符或字节 . ）

例如，我从麻省理工学院网站下载奥赛罗的文本，并以原始（字节）形式和解码为字符串后将其提供给BS . 两个对象具有相同的长度，因为文档中没有多字节字符 .

from bs4 import BeautifulSoup 
import urllib

url = "http://shakespeare.mit.edu/othello/full.html"
html_raw = urllib.request.urlopen(url).read()
html_str = urllib.request.urlopen(url).read().decode("iso-8859-1")

type(html_raw), len(html_raw)
#(<class 'bytes'>, 304769)
type(html_str), len(html_str)
#(<class 'str'>, 304769)

产生的汤对于较短的字符串/字节是相同的，但对于较长的字符串/字节则不同 . 也就是说，由字符串生成的汤突然开始将单词作为单独的字符处理，而从字节生成的汤正确处理整个文件：

BeautifulSoup(html_raw[:16410], "lxml")
#... <i>Enter OTHELLO, IAGO, and Attendants with torches</i>
#</blockquote>
#<a></a></body></html>
BeautifulSoup(html_str[:16410], "lxml")
#... <i>Enter OTHELLO, IAGO, and Attendants with torch   e   s   /   i   &gt;   
#   /   b   l   o   c   k   q   u   o   t   e   &gt;      
#
#   A   </i></blockquote></body></html>

这既适用于文档的子集（上图），也适用于整个文档：

BeautifulSoup(html_raw, "lxml")
#...
#<p><i>Exeunt</i></p>
#</blockquote></body>
#</html>

BeautifulSoup(html_str, "lxml")
#...
#   p   &gt;   i   &gt;   E   x   e   u   n   t   /   i   &gt;   /   p   &gt;   
#   /   h   t   m   l   &gt;   
#   
#   
#   </i></blockquote></body></html>

当我使用“html.parser”时，输出之间没有区别 .

这是BS实施中的一个错误吗？或者我是否违反了一些无证（或记录？）的假设？

1 回答

0
不是因为文件大小，问题可能只发生在Linux中，因为在Windows中它可以正常工作 . 这是因为html具有 windows-1252 的字符集，添加 .encode() 将解决问题
```
soup_raw = BeautifulSoup(html_raw, "lxml").encode("iso-8859-1")

soup_str = BeautifulSoup(html_str.encode("iso-8859-1"), "lxml")
```
回复于 2024-05-03T20:51:41+08:00

带有“lxml”解析器的Python BeautifulSoup将长字符串分解为字符

1 回答

相关问题