scrapy导出空csv

时间:2014-12-12 15:59:21

标签: python csv xpath scrapy

我的问题如下:scrapy export empty csv。

我的代码结构形状:

items.py:

import scrapy


class BomnegocioItem(scrapy.Item):
    title = scrapy.Field()
    pass

pipelines.py:

class BomnegocioPipeline(object):
    def process_item(self, item, spider):
        return item

settings.py:

BOT_NAME = 'bomnegocio'

SPIDER_MODULES = ['bomnegocio.spiders']
NEWSPIDER_MODULE = 'bomnegocio.spiders'
LOG_ENABLED = True

bomnegocioSpider.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from bomnegocio.items  import BomnegocioItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
import csv
import urllib2

class bomnegocioSpider(CrawlSpider):

    name = 'bomnegocio'
    allowed_domains = ["http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    rules = (Rule (SgmlLinkExtractor(allow=r'/fogao')
    , callback="parse_bomnegocio", follow= True),
    )

    print "=====> Start data extract ...."

    def parse_bomnegocio(self,response):                                                     
        #hxs = HtmlXPathSelector(response)

        #items = [] 
        item = BomnegocioItem()     

        item['title'] = response.xpath("//*[@id='ad_title']/text()").extract()[0]                        
        #items.append(item)

        return item

    print "=====> Finish data extract."     

    #//*[@id="ad_title"]

终端:

$ scrapy crawl bomnegocio -o dataextract.csv -t csv

=====> Start data extract ....
=====> Finish data extract.
2014-12-12 13:38:45-0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: bomnegocio)
2014-12-12 13:38:45-0200 [scrapy] INFO: Optional features available: ssl, http11
2014-12-12 13:38:45-0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bomnegocio.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['bomnegocio.spiders'], 'FEED_URI': 'dataextract.csv', 'BOT_NAME': 'bomnegocio'}
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled item pipelines: 
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider opened
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Crawled (200) <GET http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713> (referer: None)
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?t=&u=http%3A%2F%2Fsp.bomnegocio.com%2Fregiao-de-bauru-e-marilia%2Feletrodomesticos%2Ffogao-industrial-itajobi-4-bocas-c-forno-54183713>
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Closing spider (finished)
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 308,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 8503,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 538024),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'offsite/domains': 1,
     'offsite/filtered': 1,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 119067)}
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider closed (finished)

为什么?

===&GT; 2014-12-12 13:38:45-0200 [bomnegocio]信息:抓0页(每分钟0页),抓0件(0件/分)

$ nano dataextract.csv

看起来是空的。 =(

我做了一些假设:

i)我的抓取句子提供了错误的xpath?     我去终端并输入

$ scrapy shell "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    >>> response.xpath("//*[@id='ad_title']/text()").extract()[0] 
u'\n\t\t\t\n\t\t\t\tFog\xe3o industrial itajobi 4 bocas c/ forno \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t- '

答案:不,问题不在xpath句子中

ii)Mys“import”?     在日志视图中不显示“导入”问题。

感谢您的关注,我现在期待听取您的意见。

1 个答案:

答案 0 :(得分:0)

这只蜘蛛存在一些问题:

1)allowed_domains旨在用于域名,因此您要使用:

allowed_domains = ["bomnegocio.com"]

2)这里规则的使用不是很充分,因为它们用于定义如何抓取网站 - 要遵循的链接。在这种情况下,您不需要关注任何链接,只是想直接从start_urls中列出的网址中删除数据,因此我建议您删除rules属性,使蜘蛛扩展scrapy.Spider而刮掉默认回调parse中的数据:

from testing.items import BomnegocioItem
import scrapy

class bomnegocioSpider(scrapy.Spider):

    name = 'bomnegocio'
    allowed_domains = ["bomnegocio.com"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    def parse(self,response):
        print "=====> Start data extract ...."
        yield BomnegocioItem(
            title=response.xpath("//*[@id='ad_title']/text()").extract()[0]
        )
        print "=====> Finish data extract."

另请注意打印语句现在如何在回调中使用yield而不是return(允许您从一个页面生成多个项目)。