我想知道如何阻止它多次记录同一个网址?
到目前为止,这是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
现在它将为单个链接完成数千个重复项,例如,一个包含大约250,000个帖子的vBulletin论坛。
修改 请注意,cralwer将获得数百万的链接。 因此,我需要快速检查代码。
答案 0 :(得分:2)
创建已访问过的网址列表,并针对每个网址进行检查。因此,在解析特定URL后,将其添加到列表中。在访问新找到的URL页面之前,请检查该URL是否已在该列表中并解析并添加或跳过。
即:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
items=[] #list with your URLs
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
if link not in self.items: #check if it's already parsed
self.items.append(link) #add to list if it's not parsed yet
#do your job on adding it to a file
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
字典版:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
items={} #dictionary with your URLs as keys
f=open("items.txt","w")
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
if link not in self.items: #check if it's already parsed
self.items[link]=1 #add to dictionary as key if it's not parsed yet (stored value can be anything)
#do your job on adding it to a file
item = MyItem()
item['url'] = link.url
self.f.write(item['url']+"\n")
P.S。您也可以先收集items
,然后将其写入文件。
此代码还有许多其他改进,但我留给你学习。