首页 文章

使用beautifulsoup在换行符之间提取文本(例如<br />标签)

提问于
浏览
15

我有一个更大的文档中的以下HTML


Important Text 1

Not Important Text
Important Text 2
Important Text 3

Non Important Text
Important Text 4

我目前正在使用BeautifulSoup来获取HTML中的其他元素,但我无法找到在
标记之间获取重要文本行的方法 . 我可以隔离并导航到每个
元素,但无法找到在其间获取文本的方法 . 任何帮助将不胜感激 . 谢谢 .

3 回答

  • 4

    如果您只想要两个
    标签之间的任何文本,您可以执行以下操作:

    from BeautifulSoup import BeautifulSoup, NavigableString, Tag
    
    input = '''
    Important Text 1

    Not Important Text
    Important Text 2
    Important Text 3

    Non Important Text
    Important Text 4
    ''' soup = BeautifulSoup(input) for br in soup.findAll('br'): next_s = br.nextSibling if not (next_s and isinstance(next_s,NavigableString)): continue next2_s = next_s.nextSibling if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br': text = str(next_s).strip() if text: print "Found:", next_s

    但也许我误解了你的问题?您对问题的描述似乎与示例数据中的“重要”/“非重要”不匹配,所以我已经删除了描述;)

  • 0

    因此,出于测试目的,我们假设这个HTML块位于 span 标记内:

    x = """<span>
    Important Text 1

    Not Important Text
    Important Text 2
    Important Text 3

    Non Important Text
    Important Text 4
    </span>"""

    现在我要解析它并找到我的span标签:

    from BeautifulSoup import BeautifulSoup
    y = soup.find('span')
    

    如果你在 y.childGenerator() 中迭代生成器,你将获得br和文本:

    In [4]: for a in y.childGenerator(): print type(a), str(a)
       ....: 
    <type 'instance'> 
    <class 'BeautifulSoup.NavigableString'> Important Text 1 <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> Not Important Text <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> Important Text 2 <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> Important Text 3 <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> Non Important Text <type 'instance'>
    <class 'BeautifulSoup.NavigableString'> Important Text 4 <type 'instance'>
  • 21

    以下对我有用:

    for br in soup.findAll('br'):
        if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>':
           print br.contents[0]
    

相关问题