我是python上的新手,因此也是scrapy(抓取用python编写的网站的工具......),希望有人可以在我的路上放一些灯......我只是写了一个包含2解析的蜘蛛fonctions: - 第一个解析起始页面的解析功能我是爬行& amp;其中包含章节& 7个级别的子章节,各个级别的一些章节指向(文章或文章列表 - 第二个解析函数用于解析文章或文章列表,并作为scrapy.Request(...)的回调调用 这个蜘蛛的目的是创建一个包含章节,子章节,文章和篇章的整个内容的大型DOM。他们的内容。
我在第二个函数中遇到一个问题,它似乎收到的某些时间响应与调用scrapy.Request时使用的url上的内容不对应。将CONCURRENT_REQUESTS设置为1时,这个问题就消失了。我最初认为这是由于某些多线程/非重入函数pb但发现我没有重新进入问题并且后来读到scrapy实际上并不是多线程的...所以我无法弄清楚我的pb来自哪里。
这是我的代码片段
#---------------------------------------------
# Init part:
#---------------------------------------------
import scrapy
from scrapy import signals
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from scrapy.exceptions import CloseSpider
top = Element('top')
curChild = top
class mytest(scrapy.Spider):
name = 'lfb'
#
# This is what make my code working but I don't know why !!!
# Ideally would like to benefit from the speed of having several concurrent
# requests when crawling & parsing
#
custom_settings = {
'CONCURRENT_REQUESTS': '1',
}
#
# This section is just here to be able to do something when the spider closes
# In this case I want to print the DOM I've created.
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(mytest, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
print ("Spider closed - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
#this is to print the DOM created at the end
print tostring(top)
def parse(self, response):
pass
def start_requests(self):
level = 0
print "Start parsing legifrance level set to %d" % level
# This is to print the DOM which is empty (or almost - just the top element in there)
print tostring(top)
yield scrapy.Request("<Home Page>", callback=self.parse)
#----------------------------------------------
# First parsing function - Parsing the Home page - this one works fine (I think)
#----------------------------------------------
def parse(self, response):
for sel in response.xpath('//span'):
cl = sel.xpath("@class").extract()
desc = sel.xpath('text()').extract()
#
# Do some stuff here depending on the class (cl) of 'span' which corresponds
# to either one of the # 7 levels of chapters & sub-chapters or to list of
# articles attached to a sub-chapters. To simplify I'm just putting here the
# code corresponding to the handling of list of articles (cl == codeLienArt)
# ...
# ...
if cl == [unicode('codeLienArt')]:
art_plink= sel.css('a::attr("href")').extract()
artLink= "<Base URL>"+str(unicode(art_plink[0]))
#
# curChild points to the element in the DOM to which the list of articles
# should be attached. Pass it in the request meta, in order for the second
# parsing function to place the articles & their content at the right place
# in the DOM
#
thisChild = curChild
#
# print for debug - thisChild.text contains the heading of the sub-chapter
# to which the list of articles that will be processed by parse1 should be
# attached.
#
print "follow link cl:%s art:%s for %s" % (cl, sel.xpath('a/text()').extract(), thisChild.text )
#
# get the list of articles following artLink & pass the response to the second parsing function
# (I know it's called parse1 :-)
#
yield scrapy.Request(artLink, callback=self.parse1, meta={ 'element': thisChild })
#-------------------
# This is the second parsing function that parses list of Articles & their content
# format is basically one or several articles, each being presented(simplified) as
# < div class="Articles">
# <div class="titreArt"> Title here</div>
# <div class="corpsArt"> Sometime some text and often a list of paragraph <p>sentences</p>" ></div>
# </div>
#-------------------
def parse1(self, resp):
print "enter parse1"
numberOfArticles= 0
for selArt in resp.xpath('//div[@class="article"]'):
#
# This is where I see the problem when CONCURRENT_REQUESTS > 1, sometimes
# the response points to a page that is not the page that was requested in
# the previous parsing function...
#
clArt = selArt.xpath('.//div[@class="titreArt"]/text()').extract()
print clArt
numberOfArticles += 1
childArt = SubElement(resp.meta['element'], 'Article')
childArt.text =str(unicode("%s" % clArt[0]))
corpsArt = selArt.xpath('.//div[@class="corpsArt"]/text()').extract()
print "corpsArt=%s" % corpsArt
temp = ''
for corpsItem in corpsArt:
if corpsItem != '\n':
temp += corpsItem
if temp != '':
childCorps = SubElement(childArt, 'p')
childCorps.text = temp
print "corpsArt is not empty %s" % temp
for paraArt in selArt.xpath('.//div[@class="corpsArt"]//p/text()').extract():
childPara = SubElement(childArt, 'p')
childPara.text = paraArt
print "childPara.text=%s" % childPara.text
print "link followed %s (%d)" % (resp.url,numberOfArticles)
print "leave parse1"
yield