我在Scrapy文档中遇到问题CrawlSpider example。它似乎正好爬行,但我无法将其输出到CSV文件(或任何真正的东西)。
所以,我的问题是我可以用这个:
scrapy crawl dmoz -o items.csv
或者我是否必须创建Item Pipeline?
更新,现在使用代码!:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem
class MySpider(CrawlSpider):
name = 'abc'
allowed_domains = ['ididntuseexample.com']
start_urls = ['http://www.ididntuseexample.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
item = TargetsItem()
item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
item['link'] = response.xpath('//h2/a/@href').extract() #this pulled down data in scrapy shell
return item
答案 0 :(得分:2)
规则是CrawlSpider
用于以下链接的机制。这些链接使用LinkExtractor
定义。此元素基本上指示要从已爬网页面中提取的链接(如start_urls
列表中定义的那些)。然后,您可以传递将在每个提取的链接上调用的回调,或者更准确地说,在这些链接之后下载的页面上调用。
您的规则必须致电parse_item
。所以,替换:
Rule(LinkExtractor(allow=('ididntuseexample.com', ))),
使用:
Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),
此规则定义您要在parse_item
为href
的每个链接上致电ididntuseexample.com
。我怀疑你想要的链接提取器不是域名,而是你想要关注/抓取的链接。
这里有一个基本示例,可以抓取Hacker News来检索主页中所有新闻的标题和第一条评论的第一行。
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HackerNewsItem(scrapy.Item):
title = scrapy.Field()
comment = scrapy.Field()
class HackerNewsSpider(CrawlSpider):
name = 'hackernews'
allowed_domains = ['news.ycombinator.com']
start_urls = [
'https://news.ycombinator.com/'
]
rules = (
# Follow any item link and call parse_item.
Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
)
def parse_item(self, response):
item = HackerNewsItem()
# Get the title
item['title'] = response.xpath('//*[contains(@class, "title")]/a/text()').extract()
# Get the first words of the first comment
item['comment'] = response.xpath('(//*[contains(@class, "comment")])[1]/font/text()').extract()
return item