我是一名学生记者,也是蟒蛇新手。我一直在试图弄清楚如何使用for循环抓取我大学每日犯罪日志的所有当前页面上的每个犯罪记录。但是,它只是抓第一页。我一直在关注其他人的代码和问题,而且无法弄清楚我错过了什么。感谢任何帮助。
import urllib.request
import requests
import csv
import bs4
import numpy as np
import pandas as pd
from pandas import DataFrame
for num in range(27): #Number of pagers plus
url = ("http://police.psu.edu/daily-crime-log?field_reported_value[value]&page=0".format(num))
r = requests.get(url)
source = urllib.request.urlopen(url).read()
bs_tree = bs4.BeautifulSoup(source, "lxml")
incident_nums = bs_tree.findAll("div", class_="views-field views-field-title")
occurred = bs_tree.findAll("div", class_="views-field views-field-field-occurred")
reported = bs_tree.findAll("div", class_="views-field views-field-field-reported")
incidents = bs_tree.findAll("div", class_="views-field views-field-field-nature-of-incident")
offenses = bs_tree.findAll("div", class_="views-field views-field-field-offenses")
locations = bs_tree.findAll("div", class_="views-field views-field-field-location")
dispositions = bs_tree.findAll("div", class_="views-field views-field-field-case-disposition")
allCrimes = pd.DataFrame(columns = ['Incident#', 'Occurred', 'reported', 'nature of incident', 'offenses', 'location', 'disposition'])
total = len(incident_nums)
count = 0
while (count<total):
incNum = incident_nums[count].find("span", class_="field-content").get_text()
occr = occurred[count].find("span", class_="field-content").get_text()
repo = reported[count].find("span", class_="field-content").get_text()
incNat = incidents[count].find("span", class_="field-content").get_text()
offe = offenses[count].find("span", class_="field-content").get_text()
loca = locations[count].find("span", class_="field-content").get_text()
disp = dispositions[count].find("span", class_="field-content").get_text()
allCrimes.loc[count] =[incNum, occr, repo, incNat, offe, loca, disp]
count +=1
&#13;
答案 0 :(得分:1)
跟随他人的例子并不一定是不好的做法,但是你需要在添加它时检查这些东西是否有效,至少在你获得信心之前。
例如,如果您尝试自行运行此for循环...
>>> for num in ('29'):
... num
...
'2'
'9'
你看到Python在num中替换'2'然后在'9'中替换。不是你想要的。
如果我跟踪你的主导,检查该网站,我会看到第0到26页存在。我可以编码for num in range(27)
。理解零初始值,循环比我给出的值少一个。在您请求URL的语句中,您需要将此整数值转换为字符串值(格式化)。
你经历了多次循环而没有保留任何东西!如果你想在循环中执行其他语句,那么你需要缩进它们(或者当你提交代码时可能会发生这种情况)。
在此之后,我不清楚你在做什么。