首页 文章

在Python中刮“下一页”

提问于
浏览
0

我想抓一个网页的下一页 . 它们总共20页 . 我想用第一页的网址抓下一页 .

码:

b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
res=requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
b.append(url)
while True:   
    try:
        dct = {"data-icon":"k"}
        url=soup.find('',dct)
        url=(url['href'])
        print(url)
    except TypeError:   
        break
    if url:
        url=("https://abcde.com"+url)
        print(url)  
        b.append(url) 
print(b)

下一页的HTML:

<li class="next"><a href="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/?p=2" data-icon="k">next page</a></li>

最后一页的HTML:

<li class="next disabled"><a href="" data-icon="k">next page</a></li>

它只打印出第一页的网址 .

1 回答

  • 0

    你期望发生什么?您只需调用 requests.get(url) 一次,这是在您输入 while True 循环之前 . 您需要将 res=requests.get(url) 和所有后续行放在while循环中,以便您的代码实际获取后续页面 . 例如:

    # The following are used for debugging output in this example:
    #import sys
    #import traceback
    
    # ... Your other code...
    
    b=[]
    url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
    b.append(url)
    while True:
        try:
            res=requests.get(url)
        except:
            print("Failed while fetching " + str(url))
            print("Stack trace:")
            traceback.print_exc()
            break;
        # end try
        try:
            soup = BeautifulSoup(res.text,"lxml")
        except:
            print("Failed setting up beautiful soup parser object.")
            print("Response from request for '" + str(url) + "' was: \n\t" + str(res).replace("\n", "\n\t"), file=sys.stderr) # Avoids polluting STDOUT
            traceback.print_exc()
            break;
        # end try
    
        # The following line is not needed here because the new URL is added in the IF statement at the bottom of loop:
        # b.append(url)
    
        try:
            dct = {"data-icon":"k"}
            url=soup.find('',dct)
            url=(url['href'])
            print(url)
        except TypeError:
            print("Leaving loop after Parsing of URL from page failed.")
            break
        if url:
            url=("https://abcde.com"+url)
            print(url)  
            b.append(url)
    # end while True
    
    # Debug statement:
    print("Outside of loop.")
    
    # Print output
    print(b)
    

    这会每次都要求页面提供新的URL,因为 requests.get(url) 在循环内部,导致它在每次迭代时执行 .

相关问题