Python BeautifulSoup解析特定文本-Java 学习之路

-1

我正在解析一个html文件，我想找到它所说的"Smaller Reporting Company"文件的一部分，并且它旁边有一个"X"或Checkbox，或者它没有't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you' ll看到它旁边有一个 þ .

我没有问题显示正则表达式搜索文本的结果，但我无法进入下一步并寻找一个复选框 .

我将使用它来解析许多不同的html文件，这些文件并不都遵循相同的格式，但是大多数文件将使用表格和ascii文本，就像这个例子一样 .

这是HTML代码：

<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of &#147;large accelerated filer,&#148; &#147;accelerated filer&#148; and &#147;smaller reporting company&#148;. (Check one):
</DIV>

<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
    <TD width="22%">&nbsp;</TD>
    <TD width="3%">&nbsp;</TD>
    <TD width="22%">&nbsp;</TD>
    <TD width="3%">&nbsp;</TD>
    <TD width="22%">&nbsp;</TD>
    <TD width="3%">&nbsp;</TD>
    <TD width="22%">&nbsp;</TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT></FONT>
    </TD>
    <TD>&nbsp;</TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT></FONT>
    </TD>
    <TD>&nbsp;</TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT> </FONT>
    <FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
    </TD>
    <TD>&nbsp;</TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">&#254;</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>

这是我的Python代码：

import os, sys, string, re
from BeautifulSoup import BeautifulSoup

rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()

search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search

问题：如何将其设置为依赖于第一次搜索的第二次搜索？因此，当我找到“较小的报告公司”时，我可以搜索下几行，看看是否有ascii代码？我一直在浏览汤文档 . 我试图找到findNext但我无法让它工作 .

3 回答

0
如果您知道翼型角色的位置不会改变，您可以使用 .next .
```
>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next  # last item in list is the only good one... kinda crap
u'&#254;'
```
或者你可以上去，然后从那里 find ：
```
>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'&#254;'
```
或者你可以反过来做：
```
>>> soup.findAll(text='&#254;')[0].previous.previous
u' Smaller reporting company '
```
这假设你知道你正在寻找的翘曲的caharcters .

最后一个策略还有一个额外的好处，就是过滤掉你的正则表达式所捕获的其他垃圾，我想你只是在正确的列表上工作，所以你可以根据自己的喜好仔细阅读 if .
回复于 2024-04-29T02:38:16+08:00
0

您可以尝试遍历结构并检查内部标记内的值或检查外部标记中的值 . 我不记得如何做到这一点，我最终使用lxml，但我认为bsoup可能能做到这一点 .

如果你不能得到bsoup来检查lxml . 根据您的工作情况，它可能更快 . 它还有使用bsoup和lxml的钩子 .

回复于 2024-04-29T02:38:16+08:00

lxml 有一个宽容的HTML解析器 . 您不需要bsoup（现在已被其作者弃用），您应该避免使用正则表达式来解析HTML .

这是您正在寻找的第一个粗略切口：

guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
    font_els = list(td_el.iter('font'))
    if not font_els: continue
    print
    for el in font_els:
        print (el.text, el.attrib)

这会产生：

(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})

('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})

(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})

(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})

回复于 2024-04-29T02:38:16+08:00

Python BeautifulSoup解析特定文本

3 回答

相关问题