我的问题如下:scrapy export empty csv。
我的代码结构形状:
items.py:
import scrapy
class BomnegocioItem(scrapy.Item):
title = scrapy.Field()
pass
pipelines.py:
class BomnegocioPipeline(object):
def process_item(self, item, spider):
return item
settings.py:
BOT_NAME = 'bomnegocio'
SPIDER_MODULES = ['bomnegocio.spiders']
NEWSPIDER_MODULE = 'bomnegocio.spiders'
LOG_ENABLED = True
bomnegocioSpider.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from bomnegocio.items import BomnegocioItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
import csv
import urllib2
class bomnegocioSpider(CrawlSpider):
name = 'bomnegocio'
allowed_domains = ["http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"]
start_urls = [
"http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
]
rules = (Rule (SgmlLinkExtractor(allow=r'/fogao')
, callback="parse_bomnegocio", follow= True),
)
print "=====> Start data extract ...."
def parse_bomnegocio(self,response):
#hxs = HtmlXPathSelector(response)
#items = []
item = BomnegocioItem()
item['title'] = response.xpath("//*[@id='ad_title']/text()").extract()[0]
#items.append(item)
return item
print "=====> Finish data extract."
#//*[@id="ad_title"]
终端:
$ scrapy crawl bomnegocio -o dataextract.csv -t csv
=====> Start data extract ....
=====> Finish data extract.
2014-12-12 13:38:45-0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: bomnegocio)
2014-12-12 13:38:45-0200 [scrapy] INFO: Optional features available: ssl, http11
2014-12-12 13:38:45-0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bomnegocio.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['bomnegocio.spiders'], 'FEED_URI': 'dataextract.csv', 'BOT_NAME': 'bomnegocio'}
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled item pipelines:
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider opened
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Crawled (200) <GET http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713> (referer: None)
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?t=&u=http%3A%2F%2Fsp.bomnegocio.com%2Fregiao-de-bauru-e-marilia%2Feletrodomesticos%2Ffogao-industrial-itajobi-4-bocas-c-forno-54183713>
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Closing spider (finished)
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 308,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 8503,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 538024),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 119067)}
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider closed (finished)
为什么?
===&GT; 2014-12-12 13:38:45-0200 [bomnegocio]信息:抓0页(每分钟0页),抓0件(0件/分)
$ nano dataextract.csv
看起来是空的。 =(
我做了一些假设:
i)我的抓取句子提供了错误的xpath? 我去终端并输入
$ scrapy shell "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
>>> response.xpath("//*[@id='ad_title']/text()").extract()[0]
u'\n\t\t\t\n\t\t\t\tFog\xe3o industrial itajobi 4 bocas c/ forno \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t- '
答案:不,问题不在xpath句子中
ii)Mys“import”? 在日志视图中不显示“导入”问题。
感谢您的关注,我现在期待听取您的意见。
答案 0 :(得分:0)
这只蜘蛛存在一些问题:
1)allowed_domains
旨在用于域名,因此您要使用:
allowed_domains = ["bomnegocio.com"]
2)这里规则的使用不是很充分,因为它们用于定义如何抓取网站 - 要遵循的链接。在这种情况下,您不需要关注任何链接,只是想直接从start_urls
中列出的网址中删除数据,因此我建议您删除rules
属性,使蜘蛛扩展scrapy.Spider
而刮掉默认回调parse
中的数据:
from testing.items import BomnegocioItem
import scrapy
class bomnegocioSpider(scrapy.Spider):
name = 'bomnegocio'
allowed_domains = ["bomnegocio.com"]
start_urls = [
"http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
]
def parse(self,response):
print "=====> Start data extract ...."
yield BomnegocioItem(
title=response.xpath("//*[@id='ad_title']/text()").extract()[0]
)
print "=====> Finish data extract."
另请注意打印语句现在如何在回调中使用yield
而不是return
(允许您从一个页面生成多个项目)。