首页 文章

使用BeautifulSoup获取没有标签的文本

提问于
浏览
1

我正在尝试使用BeautifulSoup获取一些没有标签的文本 . 我尝试使用.string,.contents,.text,.find(text = True)和.next_sibling,它们列在下面 .

Edit Nvmd我刚注意到.next_sibling对我有用 . 无论如何,这个问题可以是处理类似案例的笔记收集方法 .

import bs4 as BeautifulSoup
s = """
<p>
    <a>
        Something I can fetch but don't want
    </a> 
    I want to fetch this line.
    <a>
        Something else I can fetch but don't want
    </a>
</p>
"""

p = BeautifulSoup(s, 'html.parser')
print p.contents            
    # [u'\n', <p>
    # <a>
    #     Something
    # </a> 
    #     I want to fetch this line.
    # <a>
    #     Something else
    # </a>
    # </p>, u'\n']

print p.next_sibling.string 
    # I want to fetch this line.
print p.string              
    # None
print p.text        
    # all the texts, including those I can get but don't want.
print p.find(text=True)
    # Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
    # Returns an empty line of type unicode

我想知道是否有一个比手动解析字符串s更简单的方法来获取我想要获取的行?

1 回答

  • 2

    试试这个 . 它仍然很粗糙,但至少它不需要你手动解析字符串 .

    #get all non-empty strings from the backend.
    texts = [str.strip(x) for x in p.strings if str.strip(x) != '']
    
    #get strings only with tags
    unwanted_text = [str.strip(x.text) for x in p.find_all()]
    
    #take the difference
    set(texts).difference(unwanted_text)
    

    这会产生:

    In [87]: set(texts).difference(unwanted_text)
    Out[87]: {'I want to fetch this line.'}
    

相关问题