当CONCURRENT_REQUESTS>时,用scrapy进行sc。 1

时间:2017-08-02 21:13:59

标签: python scrapy scrapy-spider

我是python上的新手,因此也是scrapy(抓取用python编写的网站的工具......),希望有人可以在我的路上放一些灯......我只是写了一个包含2解析的蜘蛛fonctions:    - 第一个解析起始页面的解析功能我是爬行& amp;其中包含章节& 7个级别的子章节,各个级别的一些章节指向(文章或文章列表    - 第二个解析函数用于解析文章或文章列表,并作为scrapy.Request(...)的回调调用 这个蜘蛛的目的是创建一个包含章节,子章节,文章和篇章的整个内容的大型DOM。他们的内容。

我在第二个函数中遇到一个问题,它似乎收到的某些时间响应与调用scrapy.Request时使用的url上的内容不对应。将CONCURRENT_REQUESTS设置为1时,这个问题就消失了。我最初认为这是由于某些多线程/非重入函数pb但发现我没有重新进入问题并且后来读到scrapy实际上并不是多线程的...所以我无法弄清楚我的pb来自哪里。

这是我的代码片段

#---------------------------------------------
# Init part:
#---------------------------------------------
import scrapy
from scrapy import signals
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from scrapy.exceptions import CloseSpider

top = Element('top')
curChild = top

class mytest(scrapy.Spider):
    name = 'lfb'

#
# This is what make my code working but I don't know why !!! 
# Ideally would like to benefit from the speed of having several concurrent
# requests when crawling & parsing 
#
    custom_settings = {
        'CONCURRENT_REQUESTS': '1',
    }

#
# This section is just here to be able to do something when the spider closes
# In this case I want to print the DOM I've created.
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(mytest, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        print ("Spider closed - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
#this is to print the DOM created at the end
        print tostring(top)

    def parse(self, response):
        pass


    def start_requests(self):
        level = 0
        print "Start parsing legifrance level set to %d" % level
# This is to print the DOM which is empty (or almost - just the top element in there)
        print tostring(top)
        yield scrapy.Request("<Home Page>", callback=self.parse)

#----------------------------------------------
# First parsing function - Parsing the Home page - this one works fine (I think)
#----------------------------------------------
    def parse(self, response):
        for sel in response.xpath('//span'):
            cl = sel.xpath("@class").extract()
            desc = sel.xpath('text()').extract()
#
# Do some stuff here depending on the class (cl) of 'span' which corresponds 
# to either one of the # 7 levels of chapters & sub-chapters or to list of
# articles attached to a sub-chapters. To simplify I'm just putting here the 
# code corresponding to the handling of list of articles (cl == codeLienArt)
#           ...
#           ...
        if cl == [unicode('codeLienArt')]: 
            art_plink= sel.css('a::attr("href")').extract()
            artLink= "<Base URL>"+str(unicode(art_plink[0]))
#
# curChild points to the element in the DOM to which the list of articles
# should be attached. Pass it in the request meta, in order for the second
# parsing function to place the articles & their content at the right place
# in the DOM
#
            thisChild = curChild
#
# print for debug - thisChild.text contains the heading of the sub-chapter
# to which the list of articles that will be processed by parse1 should be
# attached.
#
            print "follow link cl:%s art:%s for %s" % (cl, sel.xpath('a/text()').extract(), thisChild.text )
#
# get the list of articles following artLink & pass the response to the second parsing function 
# (I know it's called parse1 :-)
#
            yield scrapy.Request(artLink, callback=self.parse1, meta={ 'element': thisChild })

#-------------------
# This is the second parsing function that parses list of Articles & their content
# format is basically one or several articles, each being presented(simplified) as
# < div class="Articles">
#   <div class="titreArt"> Title here</div>
#   <div class="corpsArt"> Sometime some text and often a list of paragraph    <p>sentences</p>" ></div>
#  </div>
#-------------------
    def parse1(self, resp):
    print "enter parse1"
    numberOfArticles= 0
    for selArt in resp.xpath('//div[@class="article"]'):
#
# This is where I see the problem when CONCURRENT_REQUESTS > 1, sometimes
# the response points to a page that is not the page that was requested in
# the previous parsing function...
#
        clArt = selArt.xpath('.//div[@class="titreArt"]/text()').extract()
        print clArt
        numberOfArticles += 1
        childArt = SubElement(resp.meta['element'], 'Article')
        childArt.text =str(unicode("%s" % clArt[0]))
        corpsArt = selArt.xpath('.//div[@class="corpsArt"]/text()').extract()
        print "corpsArt=%s" % corpsArt
        temp = ''
        for corpsItem in corpsArt:
            if corpsItem != '\n':
                temp += corpsItem

        if temp != '':
            childCorps =  SubElement(childArt, 'p')
            childCorps.text = temp
            print "corpsArt is not empty %s" % temp
        for paraArt in selArt.xpath('.//div[@class="corpsArt"]//p/text()').extract():
            childPara = SubElement(childArt, 'p')
            childPara.text = paraArt
            print "childPara.text=%s" % childPara.text

    print "link followed %s (%d)" % (resp.url,numberOfArticles)
    print "leave parse1"
    yield

0 个答案:

没有答案