我正在使用Python从http:/www.city-data.com通过此目录:http://www.city-data.com/zipDir.html获取美国邮政编码填充数据。我试图抓取的具体页面是具有以下URL的单个邮政编码页面:http://www.city-data.com/zips/01001.html。我需要访问的所有单个邮政编码页面都具有相同的URL格式,因此我的脚本只针对范围内的postal_code执行以下操作:
问题是我运行脚本并且它对于~500个邮政编码工作正常,然后突然停止工作并返回重复的超时错误。我怀疑该网站的服务器正在限制来自我的IP地址的页面浏览量,这使我无法完成我需要做的抓取(所有100,000个潜在的邮政编码)。
我的问题如下:有没有办法混淆网站的服务器,例如使用某种代理,这样它就不会限制我的页面浏览量而且我可以抓取所有数据我需要?
感谢您的帮助!这是代码:
##POSTAL CODE POPULATION SCRAPER##
import requests
import re
import datetime
def zip_population_scrape():
"""
This script will scrape population data for postal codes in range
from city-data.com.
"""
postal_code_data = [['zip','population']] #list for storing scraped data
#Counters for keeping track:
total_scraped = 0
total_invalid = 0
errors = 0
for postal_code in range(1001,5000):
#This if statement is necessary because the postal code can't start
#with 0 in order for the for statement to interate successfully
if postal_code <10000:
postal_code_string = str(0)+str(postal_code)
else:
postal_code_string = str(postal_code)
#all postal code URLs have the same format on this site
url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'
#try to get current URL
try:
response = requests.get(url, timeout = 5)
http = response.status_code
#print current for logging purposes
print url +" - HTTP: " + str(http)
#if valid webpage:
if http == 200:
#save html as text
html = response.text
#extra print statement for status updates
print "HTML ready"
#try to find two substrings in HTML text
#add the substring in between them to list w/ postal code
try:
found = re.search('population in 2011:</b> (.*)<br>', html).group(1)
#add to # scraped counter
total_scraped +=1
postal_code_data.append([postal_code_string,found])
#print statement for logging
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if substrings not found, try searching for others
#and doing the same as above
except AttributeError:
found = re.search('population in 2010:</b> (.*)<br>', html).group(1)
total_scraped +=1
postal_code_data.append([postal_code_string,found])
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if http =404, zip is not valid. Add to counter and print log
elif http == 404:
total_invalid +=1
print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."
#other http codes: add to error counter and print log
else:
errors +=1
print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."
#if get url fails by connnection error, add to error count & pass
except requests.exceptions.ConnectionError:
errors +=1
print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
pass
#if get url fails by timeout error, add to error count & pass
except requests.exceptions.Timeout:
errors +=1
print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
pass
#print final log/counter data, along with timestamp finished
now= datetime.datetime.now()
print now.strftime("%Y-%m-%d %H:%M")
print str(total_scraped) + " total zips scraped."
print str(total_invalid) + " total unavailable zips."
print str(errors) + " total errors."