我不知道这个蜘蛛有什么问题,但它不会抓取任何页面:
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from paper_crawler.items import PaperCrawlerItem
class PlosGeneticsSpider(CrawlSpider):
name = 'plosgenetics'
allowed_domains = ['plosgenetics.org']
start_urls = ['http://www.plosgenetics.org/article/browse/volume']
rules = [
Rule( SgmlLinkExtractor(allow=(),restrict_xpaths=('//ul[@id="journal_slides"]')), callback='parse_item', follow=True)
]
def parse_item(self, response):
self.log(response.url)
print response.url
hxs = HtmlXPathSelector(response)
titles = hxs.select('//div[@class="item cf"]')
items = []
for title in titles:
item = PaperCrawlerItem()
item['title'] = "".join(title.xpath('.//div[@class="header"]//h3//a[contains(@href,"article")]/text()').extract()).strip()
item['URL'] = title.xpath('.//div[@class="header"]//h3//a[contains(@href,"article")]/@href').extract()
item['authors'] = "".join(title.xpath('.//div[@class="header"]//div[@class="authors"]/text()').extract()).replace('\n', "")
items.append(item)
return(items)
语法看起来正确,但它一直说INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
关于我如何搞砸的任何想法?
答案 0 :(得分:0)
我可以发现的最明显的错误是你的CrawlSpider.rules定义,它应该是一个元组,你会在this question的第一个答案中找到修复。
但简而言之,CrawlSpider.rules应该是:
rules = (Rule(...),Rule(...),)
注意括号和逗号。