首页 文章

使用BeautifulSoup在标签之间提取文本

提问于
浏览
1

我试图从一系列网页中提取文本,这些网页都遵循使用BeautifulSoup的类似格式 . 我想提取的文本的html如下 . 实际链接在这里:http://www.p2016.org/ads1/bushad120215.html .

<p><span style="color: rgb(153, 153, 153);"></span><font size="-1">      <span
 style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font><span style="color: rgb(153, 153, 153);"></span><font size="-1"><span style="font-family: Arial;"><big><span
 style="color: rgb(153, 153, 153);"></span></big></span></font><font
 size="-1"><span style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font><font size="-1"><span style="font-family: Arial;"><big><span style="color: rgb(153, 153, 153);"></span></big></span></font></p>   <p><span style="color: rgb(153, 153, 153);">[Music]</span><span
 style="text-decoration: underline;"><br>
</span></p>
<p><small><span style="text-decoration: underline;">TEXT</span>: The
Medal of Honor is the highest award for valor in action against an
enemy force</small><span style="text-decoration: underline;"><br>
</span></p>
<p><span style="text-decoration: underline;">Col. Jay Vargas</span>:&nbsp;
We
were
completely
surrounded,
116 Marines locking heads with 15,000
North Vietnamese.&nbsp; Forty hours with no sleep, fighting hand to
hand.<span style="text-decoration: underline;"><br>
<span style="font-family: helvetica,sans-serif;"><br>
</span>

我想找到一种方法来遍历我文件夹中的所有html文件,并在所有标记之间提取文本 . 我在这里列出了我的代码的相关部分:

text=[]

for page in pages:
        html_doc = codecs.open(page, 'r')
        soup = BeautifulSoup(html_doc, 'html.parser')
        for t in soup.find_all('<p>'):
            t = t.get_text()
            text.append(t.encode('utf-8'))
            print t

但是,什么都没有出现 . 为noob问题道歉,并提前感谢您的帮助 .

1 回答

  • 2

    for t in soup.find_all('<p>'):

    只需指定标签名称,而不是它的表示:

    for t in soup.find_all('p'):
    

    以下是如何将搜索范围缩小到对话段落:

    for span in soup.find_all("span", style="text-decoration: underline;"):
        text = span.next_sibling
    
        if text:
            print(span.text, text.strip())
    

相关问题