过滤刮擦管道中的物料

时间:2018-10-09 13:06:20

标签: python web-scraping scrapy

我已经从页面中删除了我想要的网址。现在,我想使用管道将它们过滤为关键字:

class GumtreeCouchesPipeline(object):

keywords = ['leather', 'couches']

def process_item(self, item, spider):
    if any(key in item['url'] for key in keywords):
        return item

问题是它现在什么也不返回。

蜘蛛:

import scrapy
from gumtree_couches.items import adItem
from urllib.parse import urljoin

class GumtreeSpider(scrapy.Spider):
    name = 'GumtreeCouches'
    allowed_domains = ['https://someurl']
    start_urls = ['https://someurl']


def parse(self, response):
    item = adItem()
    for ad_links in response.xpath('//div[@class="view"][1]//a'):
        relative_url = ad_links.xpath('@href').extract_first()
        item['title'] = ad_links.xpath('text()').extract_first()
        item['url'] = response.urljoin(relative_url)

        yield item

如何使用管道过滤关键字的所有抓取网址? 谢谢!

2 个答案:

答案 0 :(得分:1)

这应该可以解决您的问题:

class GumtreeCouchesPipeline(object):

    keywords = ['leather', 'couches']

    def process_item(self, item, spider):
        if any(key in item['url'] for key in self.keywords):
            return item

请注意,我正在使用self.keywords来引用keywords类属性。

如果您查看蜘蛛记录,应该会发现一些错误,例如:NameError: name 'keywords' is not defined

无论如何,我建议您像这样实现此管道:

from scrapy.exceptions import DropItem

class GumtreeCouchesPipeline(object):

    keywords = ['leather', 'couches']

    def process_item(self, item, spider):
        if not any(key in item['url'] for key in self.keywords):
            raise DropItem('missing keyword in URL')
        return item

这样,完成后,您将在作业统计信息中获得有关已删除项目的信息。

答案 1 :(得分:0)

通过阅读文档,我认为您必须适应所有路径,例如

from scrapy.exceptions import DropItem

    def process_item(self, item, spider):
        keywords = ['leather', 'couches']
        if item['url']:
            if any(key in item['url'] for key in keywords):
                return item
            else
                raise DropItem("Missing specified keywords.")
        else
            return item