首页 文章

用beautifulsoup4展开元素:它是否会影响父元素的.string?

提问于
浏览
2

我正在抓取一个表中的文本数据,如下所示,我希望获得结果:

Lorem ipsum dolor sit amet consectetur adipiscing elit,sed do eiusmod tempor incididunt ut labore et dolore magna aliqua . Ut enim ad minim veniam,quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat .

html = '''
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

我用beautifulsoup4打开 <span> 元素:

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

但是,我想出所有空 <td> 元素的空行,或者'dolor sit amet'行不打印,即使我在用美化打印html时可以看到它 .

# text with empty lines
for line in soup.find_all('td'):
    print(line.get_text().strip())
    print(line.string) # line with <span> prints None

# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
    print(line.get_text().strip())

print(soup.prettify())

难道我做错了什么?我如何使用unwrap()并仍然访问所有文本内容而不使用空行?

谢谢你的帮助!

1 回答

  • 0

    我可以测试,你就在附近 . 应用 strip() ,然后使用 re 模块将多个空格替换为仅一个,例如:

    from bs4 import BeautifulSoup
    import re
    
    html = ''' 
    <table>
    <tr class="title last ">
      <td>
       Lorem ipsum
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td>
       <span class="caps">dolor
       </span>
       sit amet
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td>
       consectetur adipiscing elit,
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td>
       sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td>
        Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
      </td>
      <td>
      </td>
     </tr>
    </table>
    '''
    
    soup = BeautifulSoup(html)
    
    # remove <span> tag but keep content
    spans = soup.find_all('span')
    for tag in spans:
        tag.unwrap()
    
    print('\n'.join(
      re.sub(r'\s+', ' ', td.text.strip()) 
        for td in soup.find_all('td') if td.text.strip()))
    

    它产生:

    Lorem ipsum
    dolor sit amet
    consectetur adipiscing elit,
    sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
    

相关问题