我在scrapy中的循环没有按顺序运行

时间:2013-02-01 13:30:15

标签: for-loop while-loop range scrapy

我正在抓一系列网址。代码正在运行,但scrapy没有按顺序解析URL。例如。虽然我试图解析url1,url2,...,url100,scrapy解析url2,url10,url1 ......等等。

它会解析所有网址,但是当某个特定网址不存在时(例如example.com/unit.aspx?b_id=10),Firefox会显示我之前请求的结果。因为我想确保我没有重复,我需要确保循环按顺序解析URL而不是“随意”。

我试过“for n in range(1,101)and a”while bID< 100“结果相同。(见下文)

提前感谢!

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are
    successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        bID=0
        #for n in range(1,100,1):
        while bID<100:
            bID=bID+1
            startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
            request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
            # print self.metabID
            yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

2 个答案:

答案 0 :(得分:0)

您可以在Request对象中使用use priority属性。 Scrapy保证默认情况下在DFO中抓取网址。但它并不能确保按照在解析回调中产生的顺序访问URL。

您希望返回一个请求数组,而不是生成Request对象,而是从中弹出对象直到它为空。

有关详细信息,请参阅此处

Scrapy Crawl URLs in Order

答案 1 :(得分:0)

你可以尝试这样的事情。我不确定它是否适合我的目的,因为我还没有看到蜘蛛代码的其余部分但是你去了:

# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items