首页 文章

与Python BeautifulSoup的HTML混淆

提问于
浏览
0

我在youtube上关注了newboston的教程,编译完代码后我没有错误 .

我正在尝试打印"Generic Line List"以及该列表后面的所有链接;可以在此链接的底部找到http://playrustwiki.com/wiki/List_of_Items

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages: #makes our pages change everytime
        url = 'http://playrustwiki.com/wiki/List_of_Items' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text) #find all the links in soup or all the titles
        for link in soup.findAll('a', {'class': 'a href'}): #links are a for anchors in HTML
        href = link.get('href') # href attribute
        print(href)
        page += 1

trade_spider(1)

我尝试了不同的HTML属性,但我认为这就是我的困惑开始的地方 . 我找不到正确的属性来调用我的刮刀或者我正在调用错误的属性 .

请帮忙〜

谢谢 :)

1 回答

  • 0

    这里的想法是找到具有 Generic line list 文本的元素 . 然后,通过find_next_sibling()找到下一个 ul 兄弟,并通过 find_all() 获取所有链接:

    h3 = soup.find('h3', text='Generic Line List')
    generic_line_list = h3.find_next_sibling('ul')
    for link in generic_line_list.find_all('a', href=True):
        print(link['href'])
    

    演示:

    >>> import requests
    >>> from bs4 import BeautifulSoup
    >>> 
    >>> url = 'http://playrustwiki.com/wiki/List_of_Items'
    >>> soup = BeautifulSoup(requests.get(url).content)
    >>>
    >>> h3 = soup.find('h3', text='Generic Line List')
    >>> generic_line_list = h3.find_next_sibling('ul')
    >>> for link in generic_line_list.find_all('a', href=True):
    ...     print(link['href'])
    ... 
    /wiki/Wood_Barricade
    /wiki/Wood_Shelter
    ...
    /wiki/Uber_Hunting_Bow
    /wiki/Cooked_Chicken_Breast
    /wiki/Anti-Radiation_Pills
    

相关问题