首页 文章

使用lxml从xml中提取数据的最有效方法

提问于
浏览
0

我有一个大型xml文件的以下片段 . 我想提取特定的命名空间,例如 xmlns:dc="http://purl.org/dc/elements/1.1/" . 目前我可以这样做如下:

tree = etree.parse(file)
    for element in tree.getiterator('{http://www.openarchives.org/OAI/2.0/}record'):
        for leaf in element.getiterator('{http://purl.org/dc/elements/1.1/}subject'):
            print(leaf)

问题是我希望在{http://purl.org/dc/elements/1.1/}命名空间中获取多个标签的数据 . 我还想简化一些事情并一直在研究如何使用xpath,但似乎无法弄明白 . 我可以使用xpath,如果是这样,更重要的是它对我的目标更好吗?

这是xml:

<?xml version="1.0" encoding="UTF-8" ?>



<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2013-08-15T23:24:55Z</responseDate>
<request verb="ListRecords" resumptionToken="0/500/121403/nsdl_dc/null/null/null">http://nsdldev.org/oai</request>

<!-- Showing records 501 through 1000 out of 121403 total  -->

<ListRecords>


  <record>
    <header>
      <identifier>oai:nsdl.org:2200/20110926115158975T</identifier>
      <datestamp>2013-05-29T16:44:49Z</datestamp>
       <setSpec>ncs-NSDL-COLLECTION-000-003-112-056</setSpec>
      </header>
    <metadata>
    <nsdl_dc:nsdl_dc xmlns:nsdl_dc="http://ns.nsdl.org/nsdl_dc_v1.02/"
                 xmlns:dc="http://purl.org/dc/elements/1.1/"
                 xmlns:dct="http://purl.org/dc/terms/"
                 xmlns:lar="http://ns.nsdl.org/schemas/dc/lar"
                 xmlns:ieee="http://www.ieee.org/xsd/LOMv1p0"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 schemaVersion="1.02.020"
                 xsi:schemaLocation="http://ns.nsdl.org/nsdl_dc_v1.02/ http://ns.nsdl.org/schemas/nsdl_dc/nsdl_dc_v1.02.xsd">
   <lar:readiness xsi:type="lar:Ready">Fully ready</lar:readiness>
   <dc:identifier xsi:type="dct:URI">http://www.exo.net/~emuller/activities/Hot%20Sauce%20Hot%20Spots.pdf</dc:identifier>
   <dc:relation xsi:type="nsdl_dc:NSDLPartnerURL">http://howtosmile.org/record/4427</dc:relation>
   <dc:title>Hot Sauce Hot Spots</dc:title>
   <dc:description>In this activity, learners model hot spot island formation, orientation and progression with condiments. Learners squirt a thick condiment sauce on a coarsely woven fabric to model how volcanic island hot spots form.</dc:description>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Oceanography</dc:subject>
   <dc:subject>Earth system science</dc:subject>
   <dc:subject>Geoscience</dc:subject>
   <dc:subject>Anthropology</dc:subject>
   <dc:subject>Physical science</dc:subject>
   <dc:subject>Physics</dc:subject>
   <dc:subject>General science</dc:subject>
   <dc:subject>hot spot island</dc:subject>
   <dc:subject>volcano</dc:subject>
   <dc:subject>tectonic plates</dc:subject>
   <dc:subject>Earth</dc:subject>
   <dc:subject>molten</dc:subject>
   <dc:subject>magma</dc:subject>
   <dc:subject>eruption</dc:subject>
   <dc:subject>undersea</dc:subject>
   <dc:subject>ocean</dc:subject>
   <dc:subject>island</dc:subject>
   <dc:subject>Earth Processes</dc:subject>
   <dc:subject>Volcanoes and Plate Tectonics</dc:subject>
   <dc:subject>Earth Structure</dc:subject>
   <dc:subject>Rocks and Minerals</dc:subject>
   <dc:subject>Oceans and Water</dc:subject>
   <dc:subject>Geologic Time</dc:subject>
   <dc:subject>Heat and Temperature</dc:subject>
   <dc:subject>Conducting Investigations</dc:subject>
   <dc:language>en-US</dc:language>
   <dc:format>application/pdf</dc:format>
   <lar:accessMode xsi:type="lar:ModeAcc">visual</lar:accessMode>
   <lar:accessMode xsi:type="lar:ModeAcc">tactile</lar:accessMode>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Upper Elementary</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Middle School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">High School</dct:educationLevel>
   <dct:educationLevel xsi:type="nsdl_dc:NSDLEdLevel">Informal Education</dct:educationLevel>
   <dct:audience xsi:type="nsdl_dc:NSDLAudience">Learner</dct:audience>
   <dc:type xsi:type="nsdl_dc:NSDLType">Activity</dc:type>
   <dc:type xsi:type="nsdl_dc:NSDLType">Model</dc:type>
   <dct:isPartOf>http://www.exo.net/~emuller/activities/index.html</dct:isPartOf>
   <dc:date xsi:type="dct:W3CDTF">2007</dc:date>
   <dc:creator>Eric Muller</dc:creator>
   <dc:contributor>The Exploratorium</dc:contributor>
   <dct:accessRights xsi:type="nsdl_dc:NSDLAccess">Free access</dct:accessRights>
   <dc:rights>Copyright 2007 Do Science</dc:rights>
   <dct:license>Owner license</dct:license>
   <lar:licenseProperty xsi:type="lar:LicProp">Terms of use unknown</lar:licenseProperty>
   <dct:rightsHolder>Do Science</dct:rightsHolder>
   <lar:metadataTerms>The following entity, University Corporation for Atmospheric Research (UCAR), has claims on the use of this metadata. This claim is as follows: The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. The entity provided more information at: http://nsdl.org/help/terms-of-use</lar:metadataTerms>
   <lar:metadataTerms>The National Science Digital Library (NSDL), located at the University Corporation for Atmospheric Research (UCAR), provides these metadata terms: These data and metadata may not be reproduced, duplicated, copied, sold, or otherwise exploited for any commercial purpose that is not expressly permitted by NSDL. More information is available at: http://nsdl.org/help/terms-of-use.</lar:metadataTerms>
</nsdl_dc:nsdl_dc>

    </metadata>
  </record>

2 回答

  • 2

    不清楚你想要访问什么,但尝试类似的东西:

    from lxml import etree
    doc=etree.parse( xmlfile )
    ns={'dc': 'http://purl.org/dc/elements/1.1/', 
      'oai': 'http://www.openarchives.org/OAI/2.0/'}
    doc.xpath( '//dc:subject' , namespaces=ns ) # get all of the dc:subjects
    doc.xpath( '//dc:*', namespaces=ns )  # get all elements in dc: namespace
    # more specific path 
    doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*/dc:*', namespaces=ns )
    x=doc.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:metadata/*' )
    x[0].xpath( '*[contains(.,"Geo")]' )  # you can also call xpath from non document nodes
    x[0].xpath( 'dc:subject/text()' , namespaces=ns ) # get the text of dc:subjects
    

    并阅读python或lxml文档之外的一些关于xpath的文档 . 他们告诉你如何在python中使用xpath,但它们实际上并不是一个xpath教程 .

    请注意,find(),findall()方法采用ElementPaths,这是xpath类似表达式的一种有限子集 .

  • 0
    for element in tree.findall(".//{http://purl.org/dc/elements/1.1/}subject"):
        print element
    

相关问题