Question

我是python和scrapy的新手，希望了解这种方法。我已经尝试了关于scrapy的官方教程并遵循它但它只是一个基本的例子。我在下面描述的要求是不同的，只是稍微复杂一些。

有一个站点显示数据库中的项目对于每个项目，我需要从每个单独的项目页面和搜索结果（列表）页面获取属性。搜索结果页面网址的格式为：

    http://example.com/search?&start_index=0

更改 start_index 将更改结果的起始位置。每个结果页面只显示10条记录。

结果以格式显示在表格单元格中：

    link | Desc. | Status

我需要检索Desc。和状态属性，然后按照链接到包含更多详细信息的页面，我也将检索项目我希望从任何起始索引中检索给定数量的记录。我目前使用scrapy的方法如下所示（为简洁起见）：

import scrapy

from scrapy.exceptions import CloseSpider
from cbury_scrapy.items import MyItem

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://example.com/cgi/search?&start_index=",
    ]

    url_index = 0
    URLS_PER_PAGE = 10
    records_remaining = 16
    crawl_done = False

    da = MyItem()        

    def parse(self, response):
        while self.crawl_done != True:
            url = "http://example.com/cgi/search?&start_index=" + str(self.url_index)
            yield scrapy.Request(url, callback=self.parse_results)
            self.url_index += self.URLS_PER_PAGE


    def parse_results(self, response):
        # Retrieve all table rows from results page
        for row in response.xpath('//table/tr[@class="datrack_resultrow_odd" or @class="datrack_resultrow_even"]'):
            # extract the Description and Status fields

            # extract the link to Item page
            url = r.xpath('//td[@class="datrack_danumber_cell"]//@href').extract_first()
            yield scrapy.Request(url, callback=self.parse_item)

            if self.records_remaining == 0:
                self.crawl_done = True
                raise CloseSpider('Finished scrape of requested number of records.')

            self.records_remaining -= 1

    def parse_item(self, response):
        # get fields from item page
        # ...   
        yield self.item

当 records_remaining 达到0时，代码当前不会停止，甚至在抛出 CloseSpider 异常后也是如此，这是一个错误。

我觉得这源于解析方法排列方式的错误。以“scrapy”的方式构建这个的正确方法是什么？任何帮助表示赞赏。

Answer 1

def parse(self, response):
    list_of_indexes = response.xpath('place xpath here that leads to a list of urls for indexes')
    for indexes in list_of_indexes:    
        #maybe the urls are only tags ie. ['/extension/for/index1', '/extension/for/index2', etc...]
        index_urls = ['http://domain.com' + index for index in indexes]
        yield scrapy.Request(index_urls, callback = self.parse_indexes)

def parse_index(self, response):
    da = MyItem()
    da['record_date'] = response.xpath('xpath_here')
    da['record_summary'] = response.xpath('xpath_here')
    da['additional_record_info'] = response.xpath('xpath_here')
    yield da

此示例过于简化，但我希望它有所帮助。

您希望在解析本身中实例化您的项目da = MyItem()。

要回答有关解析流程的更大问题，我将从URL开始。一旦从start_url中找到索引的XPath，就可以使用

scrapy.Requests(URL = index_url, callback =parse_indexes)

这会将你的蜘蛛引导到下一个解析方法parse_indexes。

index_url 将通过必要的xpath迭代绘制。

parse_indexes 将像解析一样，但然后将从the_next_index_url

中提取信息

如果这个答案朝着正确的方向发展，我可以稍后发布一个例子。

scrapy一般解析工作流程

1 个答案: