下面,我有一个相当直接的1层爬虫供Scrapy执行。 它访问Leis Municipais数据库中的搜索结果,为圣保罗市提供为期2年(2012-2014)的期间。 请注意,起始网址包含ASCII格式的字符
大多数情况下,例如当一个ASCII空间包含在一个URL(%20)中时,我只是使用引用字符串旁边的“u”代码来编码unicode,这是有效的(即,在抓取或解析时没有问题。)
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from lei_municipal.items import LeiMunicipalItem
class MySpider(CrawlSpider):
name = "leis"
allowed_domains = ["leismunicipais.com.br"]
start_urls = [u"https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?city=S%E3o%20Paulo&state=SP&tp=ord&page_this=1&block=0&year1=2012&year2=2014&ementaouintegra=naementa&wordkey=&&camara=1"]
rules = (Rule (SgmlLinkExtractor(allow=(),restrict_xpaths=('//a[@class="pages_ant_prox"]',)), callback="parse_items", follow= True),)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
leis = hxs.select('//div[@id="law_text"]')
items = []
for leis in leis:
item = LeiMunicipalItem()
item ["numero"] = leis.select("a/b/text()").extract()[0].encode("utf-8")
item ["descricao"] = leis.select("a/div/text()").extract()[0].encode("utf-8")
item ["url"] = leis.select("a/@href").extract()[0].encode("utf-8")
items.append(item)
return(items)
当我执行上面的代码时,我得到以下DEBUG警告:
2014-07-03 17:15:01-0300 [leis] DEBUG:重定向(元刷新)到 HTTPS://www.leismunicipais.com.br>来自https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?block=10&camara=1&city=S%C3%A3o+Paulo&ementaouintegra=naementa&page_this=2&state = SP&安培; TP = ORD&安培; wordkey =安培; YEAR1 = 2012&安培; YEAR2 = 2014> 2014-07-03 17:15:01-0300 [leis] DEBUG:重定向(元刷新)到 HTTPS://www.leismunicipais.com.br>来自https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?block=490&camara=1&city=S%C3%A3o+Paulo&ementaouintegra=naementa&page_this=50&state = SP&安培; TP = ORD&安培; wordkey =安培; YEAR1 = 2012&安培; YEAR2 = 2014> 2014-07-03 17:15:02-0300 [leis] DEBUG:Crawled(200)https://www.leismunicipais.com.br> (引荐: https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?city=S%E3o%20Paulo&state=SP&tp=ord&page_this=1&block=0&year1=2012&year2=2014&ementaouintegra=naementa&wordkey=&&camara=1) 2014-07-03 17:15:02-0300 [leis]信息:关闭蜘蛛(已完成)
第一个建议是测试xpath :
来自scrapy shell:
scrapy shell "https://www.leismunicipais.com.br/cgi-local/forpgs/topsearch.pl?city=S%E3o%20Paulo&state=SP&tp=ord&page_this=1&block=0&year1=2012&year2=2014&ementaouintegra=naementa&wordkey=&&camara=1"
测试第一个选择器:
hxs.select('//a[@class="pages_ant_prox"]/text()')
收率:
HtmlXPathSelector xpath ='// a [@ class =“pages_ant_prox”]'data = u'a HREF =“topsearch.pl?city=S%C3%A3o%20Pa'
HtmlXPathSelector xpath ='// a [@ class =“pages_ant_prox”]'data = u'a HREF =“topsearch.pl?city=S%C3%A3o%20Pa'
HtmlXPathSelector xpath ='// a [@ class =“pages_ant_prox”]'data = u'a HREF =“topsearch.pl?city=S%C3%A3o%20Pa'
HtmlXPathSelector xpath ='// a [@ class =“pages_ant_prox”]'data = u'a HREF =“topsearch.pl?city=S%C3%A3o%20Pa'
测试第二组选择器之一:
hxs.select('//div[@id="law_text"]/a/b/text()')
收率:
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16010/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16009/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16008/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16007/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16006/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16005/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16004/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16003/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16002/2014'
HtmlXPathSelector xpath ='// div [@ id =“law_text”] / a / b / text()'data = u'LEI ORDIN \ xc1RIA N \ xba:16001/2014'
我的问题是:为什么起始网址编码不正确,因此scrapy不会抓取并解析所需的信息?如何纠正?从我自己的个人解决问题来看,这个问题似乎是“%e3”字符所特有的(〜圣保罗的ã)。
在测试shell中的选择器之后,我更加困惑的是为什么当前脚本没有解析/显示所需的结果。
答案 0 :(得分:0)
这不是网址问题。该网站进行元刷新,这意味着它们可能具有各种各样的反向抓取机制。