如何让Python bs4在XML上正常工作？-Java 学习之路

我正在尝试使用Python和BeautifulSoup 4（bs4）将Inkscape SVG转换为类似XML的格式，用于某些专有软件 . 我似乎无法让bs4正确解析一个最小的例子 . 我需要解析器尊重自闭标签，处理unicode，而不是添加html东西 . 我认为用selfClosingTags指定'lxml'解析器应该这样做，但是没有！看看这个 .

#!/usr/bin/python
from __future__ import print_function
from bs4 import BeautifulSoup

print('\nbs4 mangled XML:')
print(BeautifulSoup('<x><c name="b1"><d value="a"/></c></x>',
    features = "lxml", 
    selfClosingTags = ('d')).prettify())

print('''\nExpected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>''')

这打印

bs4 mangled XML:
/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.4.1-py2.7.egg/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
<html>
 <body>
  <x>
   <c name="b1">
    <d value="a">
    </d>
   </c>
  </x>
 </body>
</html>

Expected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>

我已经回顾了相关的StackOverflow问题，但我找不到解决方案 .

This question解决了html样板文件，但仅用于解析html的子部分，而不是用于解析xml .

This question适用于获得beautifulsoup 4以尊重自闭标签，并且没有接受的答案 .

This question似乎表明传递selfClosingTags参数应该会有所帮助，但正如您所看到的，现在会生成警告 BS4 does not respect the selfClosingTags argument ，并且自动关闭标记会被破坏 .

This question建议使用"xml"（而不是"lxml"）将导致空标签自动关闭 . 这可能适用于我的目的，但将"xml"解析器应用于我的实际数据会失败，因为这些文件包含unicode，"xml"解析器不支持 .

"xml"与"lxml"不同，是否"xml"不能支持unicode，"lxml"不能包含自闭标签？也许我只是想做一些被禁止的事情？

1 回答

1
如果您希望将结果输出为 xml ，则将其解析为 . 您的 xml 数据可以包含unicode，但是，您需要声明编码：
```
#!/usr/bin/env python
# -*- encoding: utf8 -*-
```
SelfClosingTags不再被识别 . 相反，Beautiful Soup将任何空标记视为空元素标记 . 如果将子项添加到空元素标记，则它将停止为空元素标记 .

将您的功能更改为这样应该可以工作（除了编码）：
```
print(BeautifulSoup('<x><c name="b1"><d value="a®"/></c></x>',
    features = "xml").prettify())
```
Result:
```
<?xml version="1.0" encoding="utf-8"?>
<x>
 <c name="b1">
  <d value="aÂŽ"/>
 </c>
</x>
```
回复于 2024-04-27T05:35:47+08:00

如何让Python bs4在XML上正常工作？

1 回答

相关问题