URL中的ASCII字符导致Scrapy中的编码调试警告

时间:2014-07-03 20:26:17

标签: python unicode ascii scrapy web-crawler

下面,我有一个相当直接的1层爬虫供Scrapy执行。 它访问Leis Municipais数据库中的搜索结果,为圣保罗市提供为期2年(2012-2014)的期间。 请注意,起始网址包含ASCII格式的字符

大多数情况下,例如当一个ASCII空间包含在一个URL(%20)中时,我只是使用引用字符串旁边的“u”代码来编码unicode,这是有效的(即,在抓取或解析时没有问题。)

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from lei_municipal.items import LeiMunicipalItem

class MySpider(CrawlSpider):
    name = "leis"
    allowed_domains = ["leismunicipais.com.br"]
    start_urls = [u"https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?city=S%E3o%20Paulo&state=SP&tp=ord&page_this=1&block=0&year1=2012&year2=2014&ementaouintegra=naementa&wordkey=&&camara=1"]

    rules = (Rule (SgmlLinkExtractor(allow=(),restrict_xpaths=('//a[@class="pages_ant_prox"]',)), callback="parse_items", follow= True),)

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        leis = hxs.select('//div[@id="law_text"]')
        items = []
        for leis in leis:
            item = LeiMunicipalItem()
            item ["numero"] = leis.select("a/b/text()").extract()[0].encode("utf-8")
            item ["descricao"] = leis.select("a/div/text()").extract()[0].encode("utf-8")
            item ["url"] = leis.select("a/@href").extract()[0].encode("utf-8")
            items.append(item)
        return(items)

当我执行上面的代码时,我得到以下DEBUG警告:

  

2014-07-03 17:15:01-0300 [leis] DEBUG:重定向(元刷新)到   HTTPS://www.leismunicipais.com.br>来自https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?block=10&camara=1&city=S%C3%A3o+Paulo&ementaouintegra=naementa&page_this=2&state = SP&安培; TP = ORD&安培; wordkey =安培; YEAR1 = 2012&安培; YEAR2 = 2014>   2014-07-03 17:15:01-0300 [leis] DEBUG:重定向(元刷新)到   HTTPS://www.leismunicipais.com.br>来自https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?block=490&camara=1&city=S%C3%A3o+Paulo&ementaouintegra=naementa&page_this=50&state = SP&安培; TP = ORD&安培; wordkey =安培; YEAR1 = 2012&安培; YEAR2 = 2014> 2014-07-03 17:15:02-0300 [leis] DEBUG:Crawled(200)https://www.leismunicipais.com.br> (引荐:   https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?city=S%E3o%20Paulo&state=SP&tp=ord&page_this=1&block=0&year1=2012&year2=2014&ementaouintegra=naementa&wordkey=&&camara=1)   2014-07-03 17:15:02-0300 [leis]信息:关闭蜘蛛(已完成)

第一个建议是测试xpath

来自scrapy shell:

scrapy shell "https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?city=S%E3o%20Paulo&state=SP&tp=ord&page_this=1&block=0&year1=2012&year2=2014&ementaouintegra=naementa&wordkey=&&camara=1"

测试第一个选择器:

hxs.select('//a[@class="pages_ant_prox"]/text()')

收率:

  
    

HtmlXPathSelector xpath ='// a [@ class =“pages_ant_prox”]'data = u'a     HREF =“topsearch.pl?city=S%C3%A3o%20Pa'

         

HtmlXPathSelector     xpath ='// a [@ class =“pages_ant_prox”]'data = u'a     HREF =“topsearch.pl?city=S%C3%A3o%20Pa'

         

HtmlXPathSelector     xpath ='// a [@ class =“pages_ant_prox”]'data = u'a     HREF =“topsearch.pl?city=S%C3%A3o%20Pa'

         

HtmlXPathSelector     xpath ='// a [@ class =“pages_ant_prox”]'data = u'a     HREF =“topsearch.pl?city=S%C3%A3o%20Pa'

  

测试第二组选择器之一:

hxs.select('//div[@id="law_text"]/a/b/text()')

收率:

  
    

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16010/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16009/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16008/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16007/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16006/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16005/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16004/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16003/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16002/2014'

         

HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16001/2014'

  

我的问题是:为什么起始网址编码不正确,因此scrapy不会抓取并解析所需的信息?如何纠正?从我自己的个人解决问题来看,这个问题似乎是“%e3”字符所特有的(〜圣保罗的ã)。

在测试shell中的选择器之后,我更加困惑的是为什么当前脚本没有解析/显示所需的结果。

1 个答案:

答案 0 :(得分:0)

这不是网址问题。该网站进行元刷新,这意味着它们可能具有各种各样的反向抓取机制。