Question

我正在使用Python从http：/www.city-data.com通过此目录：http://www.city-data.com/zipDir.html获取美国邮政编码填充数据。我试图抓取的具体页面是具有以下URL的单个邮政编码页面：http://www.city-data.com/zips/01001.html。我需要访问的所有单个邮政编码页面都具有相同的URL格式，因此我的脚本只针对范围内的postal_code执行以下操作：

根据邮政编码创建网址
尝试从网址
如果（2），请检查该URL的HTTP
如果HTTP为200，则检索HTML并将数据清理为列表
如果HTTP不是200，则传递并计算错误（不是有效的邮政编码/ URL）
如果由于错误而没有来自网址的回复，请传递该邮政编码并计算错误
在脚本结束时，打印计数器变量和时间戳

问题是我运行脚本并且它对于~500个邮政编码工作正常，然后突然停止工作并返回重复的超时错误。我怀疑该网站的服务器正在限制来自我的IP地址的页面浏览量，这使我无法完成我需要做的抓取（所有100,000个潜在的邮政编码）。

我的问题如下：有没有办法混淆网站的服务器，例如使用某种代理，这样它就不会限制我的页面浏览量而且我可以抓取所有数据我需要？

感谢您的帮助！这是代码：

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range 
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0 
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start 
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code) 
        else:
            postal_code_string = str(postal_code) 

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL 
        try: 
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:            

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above    
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log         
            elif http == 404: 
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now() 
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped." 
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."

在使用Python抓取Web数据时，如何规避页面视图限制？

0 个答案: