首页 文章

使用Python和BS4循环刮擦多个页面

提问于
浏览
0

我是一名学生记者,也是蟒蛇新手 . 我一直在试图弄清楚如何使用for循环在我大学每日犯罪日志的所有当前页面上删除每个单独的犯罪日志 . 但是,它只是抓第一页 . 我一直在寻找其他人的代码和问题,但无法弄清楚我错过了什么 . 任何帮助表示赞赏谢谢 .

import urllib.request

import requests

import csv

import bs4

import numpy as np

import pandas as pd

from pandas import DataFrame

for num in range(27): #Number of pagers plus
    url = ("http://police.psu.edu/daily-crime-log?field_reported_value[value]&page=0".format(num))
    r = requests.get(url)

source = urllib.request.urlopen(url).read()

bs_tree = bs4.BeautifulSoup(source, "lxml")

incident_nums = bs_tree.findAll("div", class_="views-field views-field-title")

occurred = bs_tree.findAll("div", class_="views-field views-field-field-occurred")

reported = bs_tree.findAll("div", class_="views-field views-field-field-reported")

incidents = bs_tree.findAll("div", class_="views-field views-field-field-nature-of-incident")

offenses = bs_tree.findAll("div", class_="views-field views-field-field-offenses")

locations = bs_tree.findAll("div", class_="views-field views-field-field-location")

dispositions = bs_tree.findAll("div", class_="views-field views-field-field-case-disposition")

allCrimes = pd.DataFrame(columns = ['Incident#', 'Occurred', 'reported', 'nature of incident', 'offenses', 'location', 'disposition'])

total = len(incident_nums)

count = 0

while (count<total):
    incNum = incident_nums[count].find("span", class_="field-content").get_text()
    occr = occurred[count].find("span", class_="field-content").get_text()
    repo = reported[count].find("span", class_="field-content").get_text()
    incNat = incidents[count].find("span", class_="field-content").get_text()
    offe = offenses[count].find("span", class_="field-content").get_text()
    loca = locations[count].find("span", class_="field-content").get_text()
    disp = dispositions[count].find("span", class_="field-content").get_text()
    allCrimes.loc[count] =[incNum, occr, repo, incNat, offe, loca, disp]
    count +=1

1 回答

  • 1

    遵循其他人的例子不一定是不好的做法,但你需要在添加它时检查这些东西是否有效,至少在你获得信心之前 .

    例如,如果您尝试自己运行此for循环...

    >>> for num in ('29'):
    ...     num
    ...     
    '2'
    '9'
    

    你看到Python用num替换'2'然后用'9'代替 . 不是你想要的 .

    如果我跟踪你的主导,检查该网站,我看到存在第0到26页 . 我可以编码, for num in range(27) . 理解零初始值,循环比我给出的值少一个 . 在您请求URL的语句中,您需要将此整数值转换为字符串值(格式化) .

    你多次经历循环而不保留任何东西!如果你想在循环中执行其他语句,那么你需要缩进它们(或者在你提交代码时可能会发生这种情况) .

    在此之后,我不清楚你在做什么 .

相关问题