我有一个非常大的网站,其中包含许多我想搜索的URL。有没有办法告诉Scrapy忽略网址列表?
现在,我将所有URL存储在一个DB列中,我希望能够重新启动蜘蛛程序,但将一长串(24k行)传递给Scrapy,以便它知道跳过已经看到的内容。
反正有这样做吗?
class MySpider(Spider):
custom_settings = {
'AUTOTHROTTLE_ENABLED': True,
'DOWNLOAD_DELAY': 1.5,
'DEPTH_LIMIT': 0,
'JOBDIR': 'jobs/scrapy_1'
}
name = None
allowed_domains = []
start_urls = []
def parse(self, response):
for link in le.extract_links(response):
yield response.follow(link.url, self.parse)
答案 0 :(得分:1)
您将不得不将抓取的URL存储在某个地方,我通常在MySQL中进行处理,然后当我重新启动抓取器时,我会这样忽略它们
class YourSpider(scrapy.Spider):
def parse(self, response):
cursor.execute("SELECT url FROM table")
already_scraped = tuple(a['url'] for a in cursor.fetchall())
for link in le.extract_links(response):
if url not in already_scraped:
yield Request(...)
else:
self.logger.error("%s is already scraped"%(link.url))
答案 1 :(得分:0)
检查数据库中的信息:
def check_duplicate_post_links(self, links):
new_links = []
for link in links:
sql = 'SELECT id FROM your_table WHERE url = %s'
self.cursor.execute(sql, (url,))
duplicate_db = self.cursor.fetchall()
if duplicate_db:
self.logger.error("error url duplicated: {}".format(link))
else:
new_links.append(link)
return new_links
class YourSpider(scrapy.Spider):
def parse(self, response):
links = le.extract_links(response):
new_links = self.check_duplicate_post_links(links)
if len(new_links) > 0:
for link in new_links:
#Add your information
item = YourScrapyItem()
item['url'] = link.url
yield item