首页 文章

如何在链接之后使用链接和文本以及使用python之后的另一个文本提取文本

提问于
浏览
1

我已将以下 string 解析为BeautifulSoup以从中提取数据,但我无法获取一些数据 . 尝试过不同的方法 . 我设法弄清了"a"标签,链接和每个链接之外的文本之间的文本 .

<html>
 <body>
  <p align="left">
   <font face="Arial, Helvetica, sans-serif" size="2">
    <b>
     <font size="4">
      GOVERNOR:
     </font>
    </b>
    
</font> <font face="Arial, Helvetica, sans-serif" size="2"> <a href="http://governor.alabama.gov/"> <strong> Robert Bentley (R)* </strong> </a> - Ex-Morgan County Commissioner &amp; State Correctional Officer <strong>
<a href="http://www.facebook.com/stacy.george.3139"> Stacy George (R) </a> - Ex-Morgan County Commissioner &amp; State Correctional Officer
Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate &amp; '12 Scottsboro Mayor Candidate
<a href="http://www.bassforbama.com/"> Kevin Bass (D) </a> - Businessman &amp; Ex-Pro Baseball Player
<a href="http://www.parkergriffithforcongress.com/"> Parker Griffith (D) </a> - Ex-Congressman, Ex-State Sen., Physician &amp; Ex-Republican </strong> </font> </p> </body> </html>

这是我使用BeautifulSoup的实现

来自bs4进口BeautifulSoup汤= BeautifulSoup(Above_String)

"""for br in soup.find_all("br"):
    print br
    #print br.nextSibling.content
"""
for link in soup.find_all("a"):
    if link.string == None:
        print link.strong.string, link.get("href"),link.next_sibling
    else:
        print link.string, link.get("href"),link.next_sibling,link.next_sibling

上面的代码打印出如下内容:

> Robert 
                Bentley (R)*
      http://governor.alabama.gov/ 

>      Stacy George 
                (R)
      http://www.facebook.com/stacy.george.3139 
     - Ex-Morgan County Commissioner & State Correctional Officer

>      Kevin Bass (D)
      http://www.bassforbama.com/ 
     - Businessman & Ex-Pro Baseball Player


>      Parker Griffith 
                (D)
      http://www.parkergriffithforcongress.com/ 
     - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican

错过了第三个项目

Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate &amp; '12 Scottsboro Mayor Candidate

请问如何使用BeautifulSoup解决这个问题?我试图用 find_all("br") 来做,但它不起作用 br 标签返回 NoneType .

1 回答

  • 2

    抓取每个链接之外的所有文本节点:

    from itertools import takewhile
    from bs4 import NavigableString
    
    not_link = lambda t: getattr(t, 'name') not in ('a', 'strong')
    
    for link in soup.find_all("a"):
        print 'Link contents:'
        text = link.text.strip()
        for sibling in takewhile(not_link, link.next_siblings):
            if isinstance(sibling, NavigableString):
                text += unicode(sibling).strip()
            else:
                text += sibling.text.strip()
        print text
    

    这打印:

    Link contents:
    Robert 
                    Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
    Link contents:
    Stacy George 
                    (R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
    Link contents:
    Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
    Link contents:
    Parker Griffith 
                    (D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
    

相关问题