与Python BeautifulSoup的HTML混淆-Java 学习之路

我在youtube上关注了newboston的教程，编译完代码后我没有错误 .

我正在尝试打印"Generic Line List"以及该列表后面的所有链接;可以在此链接的底部找到http://playrustwiki.com/wiki/List_of_Items

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages: #makes our pages change everytime
        url = 'http://playrustwiki.com/wiki/List_of_Items' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text) #find all the links in soup or all the titles
        for link in soup.findAll('a', {'class': 'a href'}): #links are a for anchors in HTML
        href = link.get('href') # href attribute
        print(href)
        page += 1

trade_spider(1)

我尝试了不同的HTML属性，但我认为这就是我的困惑开始的地方 . 我找不到正确的属性来调用我的刮刀或者我正在调用错误的属性 .

请帮忙〜

谢谢：）

1 回答

这里的想法是找到具有 Generic line list 文本的元素 . 然后，通过find_next_sibling()找到下一个 ul 兄弟，并通过 find_all() 获取所有链接：

h3 = soup.find('h3', text='Generic Line List')
generic_line_list = h3.find_next_sibling('ul')
for link in generic_line_list.find_all('a', href=True):
    print(link['href'])

演示：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> 
>>> url = 'http://playrustwiki.com/wiki/List_of_Items'
>>> soup = BeautifulSoup(requests.get(url).content)
>>>
>>> h3 = soup.find('h3', text='Generic Line List')
>>> generic_line_list = h3.find_next_sibling('ul')
>>> for link in generic_line_list.find_all('a', href=True):
...     print(link['href'])
... 
/wiki/Wood_Barricade
/wiki/Wood_Shelter
...
/wiki/Uber_Hunting_Bow
/wiki/Cooked_Chicken_Breast
/wiki/Anti-Radiation_Pills

回复于 2024-05-06T14:40:16+08:00

与Python BeautifulSoup的HTML混淆

1 回答

相关问题