我正在尝试使用BeautifulSoup获取一些没有标签的文本 . 我尝试使用.string,.contents,.text,.find(text = True)和.next_sibling,它们列在下面 .
Edit Nvmd我刚注意到.next_sibling对我有用 . 无论如何,这个问题可以是处理类似案例的笔记收集方法 .
import bs4 as BeautifulSoup
s = """
<p>
<a>
Something I can fetch but don't want
</a>
I want to fetch this line.
<a>
Something else I can fetch but don't want
</a>
</p>
"""
p = BeautifulSoup(s, 'html.parser')
print p.contents
# [u'\n', <p>
# <a>
# Something
# </a>
# I want to fetch this line.
# <a>
# Something else
# </a>
# </p>, u'\n']
print p.next_sibling.string
# I want to fetch this line.
print p.string
# None
print p.text
# all the texts, including those I can get but don't want.
print p.find(text=True)
# Returns an empty line of type bs4.element.NavigableString
print p.find(text=True)[0]
# Returns an empty line of type unicode
我想知道是否有一个比手动解析字符串s更简单的方法来获取我想要获取的行?
1 回答
试试这个 . 它仍然很粗糙,但至少它不需要你手动解析字符串 .
这会产生: