首页 文章

lxml Xml解析

提问于
浏览
0
<xml>
<maintag>    
<content> lorem ipsum <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>

我定期解析的xml文件可能在内容标记内有标记,如上所示 .

我在这里解析文件:

parser = etree.XMLParser(remove_blank_text=False)
tree = etree.parse(StringIO(xmlFile), parser)
for item in tree.iter('maintag'):
  my_content = item.find('content').text
  #print my_content
  #output: lorem ipsum

因此,结果my_content ='lorem ipsum ' instead of -which i' d喜欢看 - 'lorem ipsum dolor sit等等'

我如何阅读内容为'lorem ipsum dolor sit等'?

注意:内容标记可能包含其他标记而不是强标记 . 并且可能根本没有它们 .

1 回答

  • 2

    _Element.text属性仅返回第一个子元素之前的文本 .

    试试以下:

    >>> from lxml import etree
    >>> from StringIO import StringIO
    >>> xmlFile = '''
    ... <xml>
    ... <maintag>
    ... <content> lorem ipsum <strong> dolor sit </strong> and so on </content>
    ... </maintag>
    ... </xml>
    ... '''
    >>> parser = etree.XMLParser(remove_blank_text=False)
    >>> tree = etree.parse(StringIO(xmlFile), parser)
    >>> for my_content in tree.xpath('maintag/content//text()'):
    ...       print my_content
    ...
     lorem ipsum
     dolor sit
     and so on
    

    要么:

    >>> for my_content in tree.find('maintag/content').itertext():
    ...     print my_content
    ...
     lorem ipsum
     dolor sit
     and so on
    
    
    >>> ' '.join(tree.find('maintag/content').itertext())
    ' lorem ipsum   dolor sit   and so on '
    >>> ' '.join(t.strip() for t in tree.find('maintag/content').itertext())
    'lorem ipsum dolor sit and so on'
    

相关问题