首页 文章

使用Python中的BeautifulSoup在链接标记之间提取文本

提问于
浏览
0

我有HTML代码,如下所示:

<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>

我正在尝试提取呈现此HTML时显示的文本 .

更具体地说,对于这个例子'a'标签,我正试图提取“EZSTORAGE - PACK IT . STORE IT.WIN - Nationwide - Restrictions - Ends 6/30/15”

但我无法提取全文,因为它被'img'标签和'span'分解 .

为了提供更多上下文,我一直在使用下面的代码搜索所有'a'标签并提取链接文本 .

for link in soup.find_all('a', id='mylink'):
    raw.append(link)
    link_text = link.contents[0].encode('utf-8')
    sweeps.append(link_text)

#output: 'EZSTORAGE - PACK IT. STORE IT. WIN - '

任何见解将不胜感激!

2 回答

  • 0

    你不能喜欢这个MWE,使用 link.text 而不是 link.contents

    text = """
    <a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
    """
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(text)
    
    for link in soup.find_all('a', id='mylink'):
        link_text = link.text
        print link_text
    

    结果:

    EZSTORAGE - PACK IT. STORE IT. WIN -  Nationwide - Restrictions - Ends 6/30/15
    
  • 0

    您可以使用常规查找所有文本

    import urllib,urllib2,re
    
    content=r'<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>''
    
    
    
    links=re.findall(r'>(.*?)<',content)
    a=""
    for link in links:
        a=a+link
    print a
    

    返回“EZSTORAGE - PACK IT . STORE IT.WIN - Nationwide - Restrictions - 2015年6月30日结束”

相关问题