我正在抓取一个表中的文本数据,如下所示,我希望获得结果:
Lorem ipsum dolor sit amet consectetur adipiscing elit,sed do eiusmod tempor incididunt ut labore et dolore magna aliqua . Ut enim ad minim veniam,quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat .
html = '''
<table>
<tr class="title last ">
<td>
Lorem ipsum
</td>
<td>
</td>
</tr>
<tr>
<td>
<span class="caps">dolor
</span>
sit amet
</td>
<td>
</td>
</tr>
<tr>
<td>
consectetur adipiscing elit,
</td>
<td>
</td>
</tr>
<tr>
<td>
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</td>
<td>
</td>
</tr>
<tr>
<td>
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</td>
<td>
</td>
</tr>
</table>
'''
我用beautifulsoup4打开 <span>
元素:
soup = BeautifulSoup(html)
# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
tag.unwrap()
但是,我想出所有空 <td>
元素的空行,或者'dolor sit amet'行不打印,即使我在用美化打印html时可以看到它 .
# text with empty lines
for line in soup.find_all('td'):
print(line.get_text().strip())
print(line.string) # line with <span> prints None
# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
print(line.get_text().strip())
print(soup.prettify())
难道我做错了什么?我如何使用unwrap()并仍然访问所有文本内容而不使用空行?
谢谢你的帮助!
1 回答
我可以测试,你就在附近 . 应用
strip()
,然后使用re
模块将多个空格替换为仅一个,例如:它产生: