在使用Python抓取Web数据时,如何规避页面视图限制?

时间:2014-08-22 17:45:52

标签: python http python-2.7 web-scraping

我正在使用Python从http:/www.city-data.com通过此目录:http://www.city-data.com/zipDir.html获取美国邮政编码填充数据。我试图抓取的具体页面是具有以下URL的单个邮政编码页面:http://www.city-data.com/zips/01001.html。我需要访问的所有单个邮政编码页面都具有相同的URL格式,因此我的脚本只针对范围内的postal_code执行以下操作:

  1. 根据邮政编码创建网址
  2. 尝试从网址
  3. 获取回复
  4. 如果(2),请检查该URL的HTTP
  5. 如果HTTP为200,则检索HTML并将数据清理为列表
  6. 如果HTTP不是200,则传递并计算错误(不是有效的邮政编码/ URL)
  7. 如果由于错误而没有来自网址的回复,请传递该邮政编码并计算错误
  8. 在脚本结束时,打印计数器变量和时间戳
  9. 问题是我运行脚本并且它对于~500个邮政编码工作正常,然后突然停止工作并返回重复的超时错误。我怀疑该网站的服务器正在限制来自我的IP地址的页面浏览量,这使我无法完成我需要做的抓取(所有100,000个潜在的邮政编码)。

    我的问题如下:有没有办法混淆网站的服务器,例如使用某种代理,这样它就不会限制我的页面浏览量而且我可以抓取所有数据我需要?

    感谢您的帮助!这是代码:

    ##POSTAL CODE POPULATION SCRAPER##
    
    import requests
    
    import re
    
    import datetime
    
    def zip_population_scrape():
    
        """
        This script will scrape population data for postal codes in range 
        from city-data.com.
        """
        postal_code_data = [['zip','population']] #list for storing scraped data
    
        #Counters for keeping track:
        total_scraped = 0 
        total_invalid = 0
        errors = 0
    
    
        for postal_code in range(1001,5000):
    
            #This if statement is necessary because the postal code can't start 
            #with 0 in order for the for statement to interate successfully
            if postal_code <10000:
                postal_code_string = str(0)+str(postal_code) 
            else:
                postal_code_string = str(postal_code) 
    
            #all postal code URLs have the same format on this site
            url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'
    
            #try to get current URL 
            try: 
                response = requests.get(url, timeout = 5)
                http = response.status_code
    
                #print current for logging purposes
                print url +" - HTTP:  " + str(http)
    
                #if valid webpage:
                if http == 200:
    
                    #save html as text
                    html = response.text
    
                    #extra print statement for status updates
                    print "HTML ready"
    
                    #try to find two substrings in HTML text
                    #add the substring in between them to list w/ postal code
                    try:            
    
                        found = re.search('population in 2011:</b> (.*)<br>', html).group(1)
    
                        #add to # scraped counter
                        total_scraped +=1
    
                        postal_code_data.append([postal_code_string,found])
    
                        #print statement for logging
                        print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                    #if substrings not found, try searching for others
                    #and doing the same as above    
                    except AttributeError:
                        found = re.search('population in 2010:</b> (.*)<br>', html).group(1)
    
                        total_scraped +=1
    
                        postal_code_data.append([postal_code_string,found])
                        print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
    
                #if http =404, zip is not valid. Add to counter and print log         
                elif http == 404: 
                    total_invalid +=1
    
                    print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."
    
                #other http codes: add to error counter and print log
                else:
                    errors +=1
    
                    print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."
    
            #if get url fails by connnection error, add to error count & pass
            except requests.exceptions.ConnectionError:
                errors +=1
                print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
                pass
    
            #if get url fails by timeout error, add to error count & pass
            except requests.exceptions.Timeout:
                errors +=1
                print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
                pass
    
    
        #print final log/counter data, along with timestamp finished
        now= datetime.datetime.now() 
        print now.strftime("%Y-%m-%d %H:%M")
        print str(total_scraped) + " total zips scraped." 
        print str(total_invalid) + " total unavailable zips."
        print str(errors) + " total errors."
    

0 个答案:

没有答案