我正在尝试抓取网络,以便在其标题中找到波兰语或波兰语的博客。我一开始遇到一些问题:我的蜘蛛能够抓住我网站的标题,但在运行时不会将其存储在文件中
scrapy crawl spider -o test.csv -t csv blogseek
以下是我的设置: 蜘蛛
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from polishblog.items import PolishblogItem
class BlogseekSpider(CrawlSpider):
name = 'blogseek'
start_urls = [
#'http://www.thepolskiblog.co.uk',
#'http://blogs.transparent.com/polish',
#'http://poland.leonkonieczny.com/blog/',
#'http://www.imaginepoland.blogspot.com'
'http://www.normalesup.org/~dthiriet'
]
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = PolishblogItem()
i['titre'] = sel.xpath('//title/text()').extract()
#i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = sel.xpath('//div[@id="name"]').extract()
#i['description'] = sel.xpath('//div[@id="description"]').extract()
return i
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class PolishblogItem(Item):
# define the fields for your item here like:
titre = Field()
#description = Field()
#url = Field()
#pass
当我跑步时
scrapy parse --spider=blogseek -c parse_item -d 2 'http://www.normalesup.org/~dthiriet'
我得到了标题。那有什么意义呢?我敢打赌这是一个愚蠢的,但找不到问题。谢谢!
编辑:可能存在反馈问题。当我使用那些settings.py:
运行时# Scrapy settings for polishblog project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'polishblog'
SPIDER_MODULES = ['polishblog.spiders']
NEWSPIDER_MODULE = 'polishblog.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'damien thiriet (+http://www.normalesup.org/~dthiriet)'
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_DELAY=0.25
ROBOTSTXT_OBEY=True
DEPTH_LIMIT=3
#stockage des resultats
FEED_EXPORTERS='CsvItemExporter'
FEED_URI='titresblogs.csv'
FEED_FORMAT='csv'
我收到错误消息
File /usr/lib/python2.7/site-packages/scrapy/contrib/feedexport.py, line 196, in_load_components
conf.update(self.settings[setting_prefix])
ValueError: dictionary update sequence element #0 has length 1; 2 is required
我按照这种方式安装了scrapy
pip2.7 install Scrapy
我错了吗? doc推荐pip install Scrapy
然后我会安装python3.4依赖项,我打赌这不是重点
编辑#2:
这是我的日志
2014-06-10 11:00:15+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: polishblog)
2014-06-10 11:00:15+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-06-10 11:00:15+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'polishblog.spiders', 'FEED_URI': 'stdout:', 'DEPTH_LIMIT': 3, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['polishblog.spiders'], 'BOT_NAME': 'polishblog', 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'damien thiriet (+http://www.normalesup.org/~dthiriet)', 'LOG_FILE': '/tmp/scrapylog', 'DOWNLOAD_DELAY': 0.25}
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled item pipelines:
2014-06-10 11:00:15+0200 [blogseek] INFO: Spider opened
2014-06-10 11:00:15+0200 [blogseek] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/robots.txt> (referer: None)
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Redirecting (301) to <GET http://www.normalesup.org/~dthiriet/> from <GET http://www.normalesup.org/~dthiriet>
2014-06-10 11:00:16+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/~dthiriet/> (referer: None)
2014-06-10 11:00:16+0200 [blogseek] INFO: Closing spider (finished)
2014-06-10 11:00:16+0200 [blogseek] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6187,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 10, 9, 0, 16, 166865),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 6, 10, 9, 0, 15, 334634)}
2014-06-10 11:00:16+0200 [blogseek] INFO: Spider closed (finished)