用BeautifulSoup替换html标签-Java 学习之路

我目前正在使用BeautifulSoup重新格式化一些HTML页面，我遇到了一些问题 .

我的问题是原始HTML有这样的事情：

<li><p>stff</p></li>

和

<li><div><p>Stuff</p></div></li>

以及

<li><div><p><strong>stff</strong></p></div><li>

使用BeautifulSoup，我希望消除div和p标签，如果它们存在，但保留强标签 .

我正在浏览美丽的汤文档，找不到任何东西 . 想法？

谢谢 .

4 回答

2

您想要做什么可以使用 replaceWith 完成 . 您必须复制要用作替换的元素，然后将其作为参数提供给 replaceWith . documentation for replaceWith非常清楚如何做到这一点 .

回复于 2024-05-01T17:11:26+08:00

这个问题可能是指旧版的BeautifulSoup，因为使用bs4你可以简单地使用unwrap函数：

s = BeautifulSoup('<li><div><p><strong>stff</strong></p></div><li>')
s.div.unwrap()
>> <div></div>
s.p.unwrap()
>> <p></p>
s
>> <html><body><li><strong>stff</strong></li><li></li></body></html>

回复于 2024-05-01T17:11:26+08:00

简单的解决方案让你的整个节点意味着 div ：

转换为字符串
将 <tag> 替换为必需的标记/字符串 .
用空字符串替换相应的标记 .
通过传递给beautifulsoup将转换后的字符串转换为可解析字符串

我为 mint 做了什么

例：

<div class="col-md-12 option" itemprop="text">
<span class="label label-info">A</span>

**-2<sup>31</sup> to 2<sup>31</sup>-1**

sup = opt.sup 
    if sup: //opt has sup tag then

         //opts converted to string. 
         opt = str(opts).replace("<sup>","^").replace("</sup>","") //replacing

         //again converted from string to beautiful string.
         s = BeautifulSoup(opt, 'lxml')

         //resign to required variable after manipulation
         opts = s.find("div", class_="col-md-12 option")

输出：

-2^31 to 2^31-1
without manipulation it will like this (-231 to 231-1)

回复于 2024-05-01T17:11:26+08:00

您可以编写自己的函数来剥离标记：

import re

def strip_tags(string):
    return re.sub(r'<.*?>', '', string)

strip_tags("<li><div><p><strong>stff</strong></p></div><li>")
'stff'

回复于 2024-05-01T17:11:26+08:00

用BeautifulSoup替换html标签

4 回答

相关问题