使用Python和BS4循环刮擦多个页面

时间:2017-03-16 17:13:55

标签: python web-scraping beautifulsoup urllib

我是一名学生记者,也是蟒蛇新手。我一直在试图弄清楚如何使用for循环抓取我大学每日犯罪日志的所有当前页面上的每个犯罪记录。但是,它只是抓第一页。我一直在关注其他人的代码和问题,而且无法弄清楚我错过了什么。感谢任何帮助。



import urllib.request

import requests

import csv

import bs4

import numpy as np

import pandas as pd

from pandas import DataFrame

for num in range(27): #Number of pagers plus
    url = ("http://police.psu.edu/daily-crime-log?field_reported_value[value]&page=0".format(num))
    r = requests.get(url)

source = urllib.request.urlopen(url).read()

bs_tree = bs4.BeautifulSoup(source, "lxml")

incident_nums = bs_tree.findAll("div", class_="views-field views-field-title")

occurred = bs_tree.findAll("div", class_="views-field views-field-field-occurred")

reported = bs_tree.findAll("div", class_="views-field views-field-field-reported")

incidents = bs_tree.findAll("div", class_="views-field views-field-field-nature-of-incident")

offenses = bs_tree.findAll("div", class_="views-field views-field-field-offenses")

locations = bs_tree.findAll("div", class_="views-field views-field-field-location")

dispositions = bs_tree.findAll("div", class_="views-field views-field-field-case-disposition")

allCrimes = pd.DataFrame(columns = ['Incident#', 'Occurred', 'reported', 'nature of incident', 'offenses', 'location', 'disposition'])

total = len(incident_nums)

count = 0

while (count<total):
    incNum = incident_nums[count].find("span", class_="field-content").get_text()
    occr = occurred[count].find("span", class_="field-content").get_text()
    repo = reported[count].find("span", class_="field-content").get_text()
    incNat = incidents[count].find("span", class_="field-content").get_text()
    offe = offenses[count].find("span", class_="field-content").get_text()
    loca = locations[count].find("span", class_="field-content").get_text()
    disp = dispositions[count].find("span", class_="field-content").get_text()
    allCrimes.loc[count] =[incNum, occr, repo, incNat, offe, loca, disp]
    count +=1
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:1)

跟随他人的例子并不一定是不好的做法,但是你需要在添加它时检查这些东西是否有效,至少在你获得信心之前。

例如,如果您尝试自行运行此for循环...

>>> for num in ('29'):
...     num
...     
'2'
'9'

你看到Python在num中替换'2'然后在'9'中替换。不是你想要的。

如果我跟踪你的主导,检查该网站,我会看到第0到26页存在。我可以编码for num in range(27)。理解零初始值,循环比我给出的值少一个。在您请求URL的语句中,您需要将此整数值转换为字符串值(格式化)。

你经历了多次循环而没有保留任何东西!如果你想在循环中执行其他语句,那么你需要缩进它们(或者当你提交代码时可能会发生这种情况)。

在此之后,我不清楚你在做什么。