首页 文章

在2标签beautifulsoup python之前解析

提问于
浏览
0

我想提取所有链接http://example.com/1并忽略带有beautifulsoup的2 <br><br> 标签后的所有链接 .

<div class="compost">
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/3"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>

这是我需要解析的部分:

<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>

这是我的代码的一部分

for links in obja.find_all("div", class_="compost"):
        if links.has_attr('href'):
            print links['href']
        #
        aa = links.findAll('a')[0]
        print aa.attrs['href']
        txt = []
        for i in links.findAll('br'):
            txt.append(i.text)
            print i.nextSibling
            if i.nextSibling.text != u'br':
                txt.append(i.nextSibling.text)

        ''.join(txt)

我的脚本提取所有链接,我不知道如何提取所有http://example.com/1并忽略 <br><br> 之后的所有链接?

1 回答

  • 1

    您可以找到第一个 <br><br> 并仅在该子字符串中搜索hrefs .

    像这样:

    from bs4 import BeautifulSoup
    
    example = """
    <div class="compost">
    <br><b><a target="_blank" href="http://example.com/1"><span id="s_index18"      class="select_index"></span>text 2</a></b>
    <br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
    <br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
    <br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
     <br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
     <br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
    <br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
    <br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
    <br>
    <br>
    <b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
    <br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
    <br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
     <br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
    <br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
    <br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
    <br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
    <br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
    <br>
    <br>
    ...."""
    
    br_split = example[0: example.index("<br>\n<br>")]
    
    soup = BeautifulSoup(br_split, "html.parser")
    
    print (soup.find_all("a"))
    

    产出:
    Output

相关问题