如何在Python中解析XML？-Java 学习之路

803

我在包含xml的数据库中有很多行，我正在尝试编写一个Python脚本，该脚本将遍历这些行并计算特定节点属性的实例数量 . 例如，我的树看起来像：

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

如何使用Python访问XML中的属性1和2？

14 回答

389

你可以使用BeautifulSoup

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

回复于 2024-05-04T15:26:56+08:00

36
只是为了增加另一种可能性，你可以使用untangle，因为它是一个简单的xml-to-python-object库 . 这里有一个例子：

Installation
```
pip install untangle
```
Usage

你的xml文件（稍有改动）：
```
<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>
```
使用untangle访问属性：
```
import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']
```
输出将是：
```
bar_name
1
```
有关untangle的更多信息，请访问here .
另外（如果你很好奇），你可以找到一个工具列表，用于处理XML和Python here（你也会看到最常见的工具被前面的答案提到） .
回复于 2024-05-04T15:26:56+08:00

minidom是最快捷，最直接的：

XML：

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

蟒蛇：

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

OUTPUT

4
item1
item1
item2
item3
item4

回复于 2024-05-04T15:26:56+08:00

我可能会建议declxml .

完全披露：我编写了这个库，因为我正在寻找一种在XML和Python数据结构之间进行转换的方法，而无需使用ElementTree编写数十行命令式解析/序列化代码 .

使用declxml，您可以使用处理器以声明方式定义XML文档的结构以及如何在XML和Python数据结构之间进行映射 . 处理器既可用于序列化和解析，也可用于基本级别的验证 .

解析Python数据结构很简单：

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

产生输出：

{'bar': {'foobar': [1, 2]}}

您还可以使用相同的处理器将数据序列化为XML

data = {'bar': {
    'foobar': [7, 3, 21, 16, 11]
}}

xml.serialize_to_string(processor, data, indent='    ')

这产生以下输出

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

如果要使用对象而不是字典，则可以定义处理器以将数据转换为对象或从对象转换数据 .

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars={})'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

这产生以下输出

{'bar': Bar(foobars=[1, 2])}

回复于 2024-05-04T15:26:56+08:00

213

XML

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

PYTHON_CODE

import xml.etree.cElementTree as ET

tree = ET.parse("foo.xml")
root = tree.getroot() 
root_tag = root.tag
print(root_tag) 

for form in root.findall("./bar/type"):
    x=(form.attrib)
    z=list(x)
    for i in z:
        print(x[i])

OUTPUT：

foo
1
2

回复于 2024-05-04T15:26:56+08:00

这是一个非常简单但有效的代码，使用 cElementTree .

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

资源：

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1

回复于 2024-05-04T15:26:56+08:00

Python有一个expat xml解析器的接口 .

xml.parsers.expat

它's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you' ll可能是 get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

回复于 2024-05-04T15:26:56+08:00

614

为简单起见，我建议xmltodict .

它将您的xml解析为OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

回复于 2024-05-04T15:26:56+08:00

那里有很多选择 . 如果速度和内存使用成为问题，cElementTree看起来很棒 . 与仅使用 readlines 读取文件相比，它的开销非常小 .

相关指标可在下表中找到，从cElementTree网站复制：

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k

正如@jfs所指出的， cElementTree 与Python捆绑在一起：

Python 2： from xml.etree import cElementTree as ElementTree .
Python 3： from xml.etree import ElementTree （自动使用加速C版本） .

回复于 2024-05-04T15:26:56+08:00

80

我发现Python xml.dom 和 xml.dom.minidom 非常简单 . 请记住，DOM不适合大量的XML，但如果您的输入相当小，那么这将很好 .

回复于 2024-05-04T15:26:56+08:00
9
我建议ElementTree . 同一API的其他兼容实现，例如Python标准库本身中的lxml和 cElementTree ;但是，在这种情况下，他们主要添加的是更快的速度 - 编程部分的简易性取决于API， ElementTree 定义 .

从XML构建Element实例 e 之后，例如使用XML函数，或通过解析类似的文件
```
import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('thefile.xml').getroot()
```
或者ElementTree中显示的任何其他方式，您只需执行以下操作：
```
for atype in e.findall('type'):
    print(atype.get('foobar'))
```
和类似的，通常很简单的代码模式 .
回复于 2024-05-04T15:26:56+08:00
5
xml.etree.ElementTree vs. lxml

这些是两个最常用的库的一些优点，在我们选择它们之前我会有所了解 .

xml.etree.ElementTree：
- 来自 standard library ：无需安装任何模块
lxml
- 轻松写 XML declaration ：你需要添加例如standalone = "no"？
- Pretty printing ：你可以拥有一个不错的 indented XML而无需额外的代码 .
- Objectify 功能：它允许您使用XML，就像处理普通的Python对象层次结构一样.a
回复于 2024-05-04T15:26:56+08:00

lxml.objectify真的很简单 .

拿你的示例文本：

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

输出：

{'1': 1, '2': 1}

回复于 2024-05-04T15:26:56+08:00

import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

这将打印foobar属性的值 .

回复于 2024-05-04T15:26:56+08:00

如何在Python中解析XML？

14 回答

xml.etree.ElementTree vs. lxml

xml.etree.ElementTree：

lxml

相关问题