BeautifulSoup：无论有多少封闭标签，都可以进入标签内部-Java 学习之路

我正在尝试使用BeautifulSoup从网页中的 <p> 元素中删除所有内部html . 有内部标签，但我不在乎，我只想获得内部文本 .

例如，对于：

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

我怎样才能提取：

Red
Blue
Yellow
Light green

.string 和 .contents[0] 都不能满足我的需要 . .extract() 也没有，因为我不想提前指定内部标签 - 我想处理任何可能发生的事情 .

BeautifulSoup中是否有'just get the visible HTML'类型的方法？

---- ------ UPDATE

在建议上，尝试：

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

但这没有帮助 - 它打印出来：

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8

4 回答

0
首先，使用 str 将html转换为字符串 . 然后，在程序中使用以下代码：
```
import re
x = str(soup.find_all('p'))
content = str(re.sub("<.*?>", "", x))
```
这被称为 regex . 这个将删除两个html标签之间的任何内容（包括标签） .
回复于 2024-04-29T17:58:44+08:00

简答： soup.findAll(text=True)

这已经回答了here on StackOverflow和BeautifulSoup documentation .

UPDATE:

澄清一下，一段代码：

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
    print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

回复于 2024-04-29T17:58:44+08:00

接受的答案很棒，但现在已经有6年了，所以这是答案的当前Beautiful Soup 4 version：

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green

回复于 2024-04-29T17:58:44+08:00

通常，从网站上删除的数据将包含标签 . 要避免使用该标签并仅显示文本内容，可以使用文本属性 .

例如，

from BeautifulSoup import BeautifulSoup

    import urllib2 
    url = urllib2.urlopen("https://www.python.org")

    content = url.read()

    soup = BeautifulSoup(content)

    title = soup.findAll("title")

    paragraphs = soup.findAll("p")

    print paragraphs[1] //Second paragraph with tags

    print paragraphs[1].text //Second paragraph without tags

在这个例子中，我从python站点收集所有段落，并用标签显示它，没有标签 .

回复于 2024-04-29T17:58:44+08:00

BeautifulSoup：无论有多少封闭标签，都可以进入标签内部

4 回答

相关问题