我使用 Scrapy 创建了一个爬虫。爬虫正在抓取网站以获取 URL。 使用的技术:Python Scrapy 问题:我有重复的 URL。 我需要的输出是: 我希望爬虫抓取网站并获取 URL,但不抓取重复的 URL。 示例代码: 我已将此代码添加到我的 settings.py 文件中。 DUPEFILTER_CLASS ='scrapy.dupefilter.RFPDupeFilter' 我运行了文件,它说找不到模块。
import scrapy
import os
import scrapy.dupefilters
class MySpider(scrapy.Spider):
name = 'feed_exporter_test'
# this is equivalent to what you would set in settings.py file
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'inputLinks2.csv'
}
filePath='inputLinks2.csv'
if os.path.exists(filePath):
os.remove(filePath)
else:
print("Can not delete the file as it doesn't exists")
start_urls = ['https://www.mytravelexp.com/']
def parse(self, response):
titles = response.xpath("//a/@href").extract()
for title in titles:
yield {'title': title}
def __getid(self, url):
mm = url.split("&refer")[0] #or something like that
return mm
def request_seen(self, request):
fp = self.__getid(request.url)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)
请帮忙!!
答案 0 :(得分:0)
Scrapy 默认过滤重复请求。