首页 文章

Python - 使用HTML标记进行Web抓取

提问于
浏览
0

我正在尝试抓取一个网页列出URL中发布的作业:https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad

有关网页检查的详细信息,请参阅图像Web inspect

通过网页检查观察到以下情况:

  • 列出的每个作业都在HTML li中,其中class =“jobs-list-item” . Li在li中的父Div中包含以下html标记和数据

data-ph-at-job-title-text =“软件工程师II”,data-ph-at-job-category-text =“工程”,data-ph-at-job-post-date-text =“2018 -03-19T16:33:00" .

  • 父级Div中的第一个子级div,其中class =“information”的HTML包含url href =“https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II”

  • 第3个子Div,其中class = "description au-target"在父Div中有简短的职位描述

我的要求是提取每项工作的以下信息

  • 职称

  • 职位类别

  • 职位发布日期

  • 职位发布时间

  • 工作网址

  • 工作简述

我已经尝试使用Python代码来抓取网页,但无法提取所需的信息 . (请忽略下面代码中显示的缩进)

import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)

if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")

ms_jobs()

1 回答

  • 1

    如果您想通过请求执行此操作,则需要对网站进行反向工程 . 在Chrome中打开开发工具,选择网络标签并填写表单 .

    这将显示网站如何加载数据 . 如果您深入了解您将看到的站点,它会通过对此 endpoints 执行POST来获取数据:https://careers.microsoft.com/widgets . 它还会显示站点使用的有效负载 . 该站点使用cookie,因此您所要做的就是创建一个会话来保存cookie,获取一个并复制/粘贴有效负载 .

    通过这种方式,您将能够提取相同的json数据,javascript将提取这些数据以动态填充网站 .

    下面是一个看起来像的工作示例 . 左边只是为了解析你认为合适的json .

    import requests
    from pprint import pprint
    
    # create a session to grab a cookie from the site
    session = requests.Session()
    r = session.get("https://careers.microsoft.com/us/en/")
    
    # these params are the ones that the dev tools show that site sets when using the website form
    payload = {
        "lang":"en_us",
        "deviceType":"desktop",
        "country":"us",
        "ddoKey":"refineSearch",
        "sortBy":"",
        "subsearch":"",
        "from":0,
        "jobs":"true",
        "counts":"true",
        "all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
        "pageName":"search-results",
        "size":20,
        "keywords":"",
        "global":"true",
        "selected_fields":{"city":["Hyderabad"],"country":["India"]},
        "sort":"null",
        "locationData":{}
    }
    
    # this is the endpoint the site uses to fetch json
    url = "https://careers.microsoft.com/widgets"
    r = session.post(url, json=payload)
    data = r.json()
    job_list = data['refineSearch']['data']['jobs']
    
    # the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
    job = job_list[0]
    pprint(job)
    

    干杯 .

相关问题