在Python中刮“下一页”-Java 学习之路

我想抓一个网页的下一页 . 它们总共20页 . 我想用第一页的网址抓下一页 .

码：

b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
res=requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
b.append(url)
while True:   
    try:
        dct = {"data-icon":"k"}
        url=soup.find('',dct)
        url=(url['href'])
        print(url)
    except TypeError:   
        break
    if url:
        url=("https://abcde.com"+url)
        print(url)  
        b.append(url) 
print(b)

下一页的HTML：

<li class="next"><a href="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/?p=2" data-icon="k">next page</a></li>

最后一页的HTML：

<li class="next disabled"><a href="" data-icon="k">next page</a></li>

它只打印出第一页的网址 .

1 回答

你期望发生什么？您只需调用 requests.get(url) 一次，这是在您输入 while True 循环之前 . 您需要将 res=requests.get(url) 和所有后续行放在while循环中，以便您的代码实际获取后续页面 . 例如：

# The following are used for debugging output in this example:
#import sys
#import traceback

# ... Your other code...

b=[]
url="https://abcde.com/cate6-%E7%BE%8E%E5%A6%9D%E4%BF%9D%E9%A4%8A/"
b.append(url)
while True:
    try:
        res=requests.get(url)
    except:
        print("Failed while fetching " + str(url))
        print("Stack trace:")
        traceback.print_exc()
        break;
    # end try
    try:
        soup = BeautifulSoup(res.text,"lxml")
    except:
        print("Failed setting up beautiful soup parser object.")
        print("Response from request for '" + str(url) + "' was: \n\t" + str(res).replace("\n", "\n\t"), file=sys.stderr) # Avoids polluting STDOUT
        traceback.print_exc()
        break;
    # end try

    # The following line is not needed here because the new URL is added in the IF statement at the bottom of loop:
    # b.append(url)

    try:
        dct = {"data-icon":"k"}
        url=soup.find('',dct)
        url=(url['href'])
        print(url)
    except TypeError:
        print("Leaving loop after Parsing of URL from page failed.")
        break
    if url:
        url=("https://abcde.com"+url)
        print(url)  
        b.append(url)
# end while True

# Debug statement:
print("Outside of loop.")

# Print output
print(b)

这会每次都要求页面提供新的URL，因为 requests.get(url) 在循环内部，导致它在每次迭代时执行 .

回复于 2024-05-02T20:02:12+08:00

在Python中刮“下一页”

1 回答

相关问题