Question

我想知道如何阻止它多次记录同一个网址？

到目前为止，这是我的代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        item = MyItem()
        item['url'] = link.url
        self.f.write(item['url']+"\n")

现在它将为单个链接完成数千个重复项，例如，一个包含大约250,000个帖子的vBulletin论坛。

修改请注意，cralwer将获得数百万的链接。因此，我需要快速检查代码。

Answer 1

创建已访问过的网址列表，并针对每个网址进行检查。因此，在解析特定URL后，将其添加到列表中。在访问新找到的URL页面之前，请检查该URL是否已在该列表中并解析并添加或跳过。

即：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items=[] #list with your URLs
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items.append(link)   #add to list if it's not parsed yet
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

字典版：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items={} #dictionary with your URLs as keys
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items[link]=1  #add to dictionary as key if it's not parsed yet (stored value can be anything)
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

P.S。您也可以先收集items，然后将其写入文件。

此代码还有许多其他改进，但我留给你学习。

如何阻止我的抓取工具记录重复项？

1 个答案: