Question

所以我在看这个问题：Pass extra values along with urls to scrapy spider

这解决了我的部分问题。我基本上有一个长数组，如：

（身份证，国家，网址）......

使用start_requests，我成功地传递了ID和国家/地区作为项目以及我正在解析的其他项目。

mapping = [(001, USA, url1), etc. etc.]
def start_requests(self):
        for url, ID, country in self.mapping:
            yield Request(url, callback=self.parse_items, meta={'country': country, 'ID': ID})

但是，当我使用Crawlspider时，我的代码不起作用。它会跳过规则，并且不会比第一页进一步解析。

规则=（规则（LxmlLinkExtractor（restrict_xpaths =＆＃39; // [@name =＆＃34;＆amp; lid = pagination-next＆＃34;]＆＃39;），callback =＆＃34; parse_items＆＃34;，follow = True），）

def parse_start_url(self, response):
    return self.parse_items(response)

def parse_items(self, response):
    country = response.meta['country']
    id = response.meta['id']

    items = []
    item['country'] = country
    item['id'] = id
    item['item1'] = response.xpath(...).extract()
    item['item2'] = response.xpath(...).extract()

...
...

我的问题是如何保留我的初始映射元组的附加ID并遵循Crawlspider规则来浏览这些网站？它刮擦FIRST页面没问题，但无法抓取。

我还应该说，没有start_requests，只有start_urls，它没有问题，但当然我不能在我的生活中找出如何标记＆＃39; country＆＃39;并且＆＃39; id＆＃39;在抓取时与这些网址相关联。

如何传递额外的参数/值以及start_url以便在CrawSpider中使用？

0 个答案: