使用BeautifulSoup获取没有标签的文本-Java 学习之路

我正在尝试使用BeautifulSoup获取一些没有标签的文本 . 我尝试使用.string，.contents，.text，.find（text = True）和.next_sibling，它们列在下面 .

Edit Nvmd我刚注意到.next_sibling对我有用 . 无论如何，这个问题可以是处理类似案例的笔记收集方法 .

import bs4 as BeautifulSoup
s = """
<p>
    <a>
        Something I can fetch but don't want
    </a> 
    I want to fetch this line.
    <a>
        Something else I can fetch but don't want
    </a>
</p>
"""

p = BeautifulSoup(s, 'html.parser')
print p.contents            
    # [u'\n', <p>
    # <a>
    #     Something
    # </a> 
    #     I want to fetch this line.
    # <a>
    #     Something else
    # </a>
    # </p>, u'\n']

print p.next_sibling.string 
    # I want to fetch this line.
print p.string              
    # None
print p.text        
    # all the texts, including those I can get but don't want.
print p.find(text=True)
    # Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
    # Returns an empty line of type unicode

我想知道是否有一个比手动解析字符串s更简单的方法来获取我想要获取的行？

1 回答

试试这个 . 它仍然很粗糙，但至少它不需要你手动解析字符串 .

#get all non-empty strings from the backend.
texts = [str.strip(x) for x in p.strings if str.strip(x) != '']

#get strings only with tags
unwanted_text = [str.strip(x.text) for x in p.find_all()]

#take the difference
set(texts).difference(unwanted_text)

这会产生：

In [87]: set(texts).difference(unwanted_text)
Out[87]: {'I want to fetch this line.'}

回复于 2024-04-30T00:07:44+08:00

使用BeautifulSoup获取没有标签的文本

1 回答

相关问题