我已经从页面中删除了我想要的网址。现在,我想使用管道将它们过滤为关键字:
class GumtreeCouchesPipeline(object):
keywords = ['leather', 'couches']
def process_item(self, item, spider):
if any(key in item['url'] for key in keywords):
return item
问题是它现在什么也不返回。
蜘蛛:
import scrapy
from gumtree_couches.items import adItem
from urllib.parse import urljoin
class GumtreeSpider(scrapy.Spider):
name = 'GumtreeCouches'
allowed_domains = ['https://someurl']
start_urls = ['https://someurl']
def parse(self, response):
item = adItem()
for ad_links in response.xpath('//div[@class="view"][1]//a'):
relative_url = ad_links.xpath('@href').extract_first()
item['title'] = ad_links.xpath('text()').extract_first()
item['url'] = response.urljoin(relative_url)
yield item
如何使用管道过滤关键字的所有抓取网址? 谢谢!
答案 0 :(得分:1)
这应该可以解决您的问题:
class GumtreeCouchesPipeline(object):
keywords = ['leather', 'couches']
def process_item(self, item, spider):
if any(key in item['url'] for key in self.keywords):
return item
请注意,我正在使用self.keywords
来引用keywords
类属性。
如果您查看蜘蛛记录,应该会发现一些错误,例如:NameError: name 'keywords' is not defined
。
无论如何,我建议您像这样实现此管道:
from scrapy.exceptions import DropItem
class GumtreeCouchesPipeline(object):
keywords = ['leather', 'couches']
def process_item(self, item, spider):
if not any(key in item['url'] for key in self.keywords):
raise DropItem('missing keyword in URL')
return item
这样,完成后,您将在作业统计信息中获得有关已删除项目的信息。
答案 1 :(得分:0)
通过阅读文档,我认为您必须适应所有路径,例如
from scrapy.exceptions import DropItem
def process_item(self, item, spider):
keywords = ['leather', 'couches']
if item['url']:
if any(key in item['url'] for key in keywords):
return item
else
raise DropItem("Missing specified keywords.")
else
return item