Scrapy - 无法将已删除的值存储在文件中

时间:2014-06-09 22:58:49

标签: python-2.7 scrapy

我正在尝试抓取网络,以便在其标题中找到波兰语或波兰语的博客。我一开始遇到一些问题:我的蜘蛛能够抓住我网站的标题,但在运行时不会将其存储在文件中

scrapy crawl spider -o test.csv -t csv blogseek

以下是我的设置: 蜘蛛

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from polishblog.items import PolishblogItem

class BlogseekSpider(CrawlSpider):
    name = 'blogseek'
    start_urls = [
          #'http://www.thepolskiblog.co.uk',
          #'http://blogs.transparent.com/polish',
          #'http://poland.leonkonieczny.com/blog/',
          #'http://www.imaginepoland.blogspot.com' 
          'http://www.normalesup.org/~dthiriet'
          ]

    rules = (
        Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

    )

    def parse_item(self, response):
        sel = Selector(response)
        i = PolishblogItem()
        i['titre'] = sel.xpath('//title/text()').extract()
        #i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = sel.xpath('//div[@id="name"]').extract()
        #i['description'] = sel.xpath('//div[@id="description"]').extract()
        return i

items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class PolishblogItem(Item):
    # define the fields for your item here like:
    titre = Field()
    #description = Field()
    #url = Field()
    #pass

当我跑步时

scrapy parse --spider=blogseek -c parse_item -d 2 'http://www.normalesup.org/~dthiriet'

我得到了标题。那有什么意义呢?我敢打赌这是一个愚蠢的,但找不到问题。谢谢!

编辑:可能存在反馈问题。当我使用那些settings.py:

运行时
# Scrapy settings for polishblog project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'polishblog'

SPIDER_MODULES = ['polishblog.spiders']
NEWSPIDER_MODULE = 'polishblog.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'damien thiriet (+http://www.normalesup.org/~dthiriet)'

COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_DELAY=0.25
ROBOTSTXT_OBEY=True
DEPTH_LIMIT=3

#stockage des resultats
FEED_EXPORTERS='CsvItemExporter'
FEED_URI='titresblogs.csv'
FEED_FORMAT='csv'

我收到错误消息

File /usr/lib/python2.7/site-packages/scrapy/contrib/feedexport.py, line 196, in_load_components
conf.update(self.settings[setting_prefix])
ValueError: dictionary update sequence element #0 has length 1; 2 is required

我按照这种方式安装了scrapy

pip2.7 install Scrapy

我错了吗? doc推荐pip install Scrapy然后我会安装python3.4依赖项,我打赌这不是重点

编辑#2:

这是我的日志

2014-06-10 11:00:15+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: polishblog)
2014-06-10 11:00:15+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-06-10 11:00:15+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'polishblog.spiders', 'FEED_URI': 'stdout:', 'DEPTH_LIMIT': 3, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['polishblog.spiders'], 'BOT_NAME': 'polishblog', 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'damien thiriet (+http://www.normalesup.org/~dthiriet)', 'LOG_FILE': '/tmp/scrapylog', 'DOWNLOAD_DELAY': 0.25}
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled item pipelines: 
2014-06-10 11:00:15+0200 [blogseek] INFO: Spider opened
2014-06-10 11:00:15+0200 [blogseek] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/robots.txt> (referer: None)
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Redirecting (301) to <GET http://www.normalesup.org/~dthiriet/> from <GET http://www.normalesup.org/~dthiriet>
2014-06-10 11:00:16+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/~dthiriet/> (referer: None)
2014-06-10 11:00:16+0200 [blogseek] INFO: Closing spider (finished)
2014-06-10 11:00:16+0200 [blogseek] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 737,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 6187,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 2,
     'downloader/response_status_count/301': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 6, 10, 9, 0, 16, 166865),
     'log_count/DEBUG': 5,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 6, 10, 9, 0, 15, 334634)}
2014-06-10 11:00:16+0200 [blogseek] INFO: Spider closed (finished)

0 个答案:

没有答案