BeautifulSoup抑制lxml解析erorrs？-Java 学习之路

我使用lxml与BeautifulSoup一起解析和导航XML文件 .

我注意到奇怪的行为 . 当读取格式错误的XML文件（例如截断的doc或缺少结束标记）时，Beautifulsoup会抑制lxml解析器抛出的异常 .

例：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<foo><bar>trololo<", "xml") # this will work

它甚至可以调用find（）并导航这样破碎的XML树......

让我们尝试使用纯lxml读取完全相同的格式错误的文档：

from lxml import etree
root = etree.fromstring("<foo><bar>trololo<") # will throw XMLSyntaxError

为什么是这样？我知道BeautifulSoup本身没有进行任何解析，它只是围绕lxml（或其他解析器）的包装器库 . 但是我感兴趣的是，如果XML格式不正确，实际上会出错 . 关闭标签丢失了 . 我只想要基本的XML语法验证（对XSD架构验证不感兴趣） .

1 回答

如果要复制行为，可以设置recover = True传递解析器：

from lxml import etree

root = etree.fromstring("<foo><bar>trololo<",parser=etree.XMLParser(recover=True)) # will throw XMLSyntaxError

print(etree.tostring(root))

输出：

<foo><bar>trololo</bar></foo>

如果您查看构建器目录中的bs4源代码，您将看到 _lxml.py 并在其中：

def default_parser(self, encoding):
    # This can either return a parser object or a class, which
    # will be instantiated with default arguments.
    if self._default_parser is not None:
        return self._default_parser
    return etree.XMLParser(
        target=self, strip_cdata=False, recover=True, encoding=encoding)

lxml的HTMLParser默认设置它，因此它可以处理损坏的html，使用xml你必须指定你想要尝试和恢复 .

回复于 2024-04-27T08:03:37+08:00

BeautifulSoup抑制lxml解析erorrs？

1 回答

相关问题