scrapy中是否有办法获取当前请求队列或使用scrapy动态构建它?
我的用例是我有产品列表页面,此页面上有多个过滤器。这些过滤器会产生不同的URL(因为查询字符串不同),但实际列出的产品是相同的。所以我想做的是首先使用各种过滤器获取所有产品的所有网址,然后删除蜘蛛级别本身的重复项,并在请求中添加元数据,以获得产生相同产品的过滤器。
通过这种方式,我可以识别重复项,而无需先实际解析产品页面,还可以将元数据与项目相关联。
我希望实现的伪代码:
def parse(self, response):
filter_pages = for all filter links present on the page download/request the filtered page
all_urls = []
for filter in filter_pages:
url = scrape url for all product/item links
all_urls.append(url)
for url in all_urls:
filter_tags = identify duplicates in all_urls (based on the product id present in query string) and returns csv filters that resulted in duplicates.
request = Request(url, callback = self.parseItem)
request.meta['tag'] = filter_tags
yield request