Question

我已经在固定域内为该craws构建了一个爬虫，并提取了与修复正则表达式匹配的url。如果看到特定网址，则抓取工具会跟踪链接。抓取工具完美地提取网址，但每次我运行抓取工具时，它会返回不同数量的链接，即每次运行链接的数量都不同。我正在使用Scrapy爬行。这是scrapy的一些问题吗？代码是：

class MySpider(CrawlSpider):
   name = "xyz"
   allowed_domains = ["xyz.nl"]
   start_urls = ["http://www.xyz.nl/Vacancies"] 
   rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)



 def parse_item(self, response):

  outputfile = open('urllist.txt','a')
  print response.url
  outputfile.write(response.url+'\n')

Answer 1

使用scrapy的内置item exporters，而不是在a方法中手动编写链接并使用parse_item()模式打开文件。使用链接字段定义项目：

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url = Field()


class MySpider(CrawlSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    start_urls = ["http://www.xyz.nl/Vacancies"]
    rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),
             Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)

    def parse_item(self, response):
        item = MyItem()
        item['url'] = response.url
        yield item

Scrapy不同数量的url返回

1 个答案: