我已经在固定域内为该craws构建了一个爬虫,并提取了与修复正则表达式匹配的url。如果看到特定网址,则抓取工具会跟踪链接。抓取工具完美地提取网址,但每次我运行抓取工具时,它会返回不同数量的链接,即每次运行链接的数量都不同。我正在使用Scrapy爬行。这是scrapy的一些问题吗?代码是:
class MySpider(CrawlSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
start_urls = ["http://www.xyz.nl/Vacancies"]
rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)
def parse_item(self, response):
outputfile = open('urllist.txt','a')
print response.url
outputfile.write(response.url+'\n')
答案 0 :(得分:1)
使用scrapy的内置item exporters,而不是在a
方法中手动编写链接并使用parse_item()
模式打开文件。使用链接字段定义项目:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url = Field()
class MySpider(CrawlSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
start_urls = ["http://www.xyz.nl/Vacancies"]
rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)
def parse_item(self, response):
item = MyItem()
item['url'] = response.url
yield item