我尝试使用Scrapy 1.3.2和Python 2.7.13下载CSV文件,到目前为止没有任何运气。
以下是蜘蛛的代码:
import scrapy
class FinancialFilesItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
class FinancialsSpider(scrapy.Spider):
name = "Financials Spider"
allowed_domains = ["financials.morningstar.com"]
def __init__(self, url):
super(FinancialsSpider, self).__init__()
self.start_urls = url
def parse(self, response):
result = FinancialFilesItem()
result['file_urls'] = [response.url]
yield result
这里是调用蜘蛛的主要代码:
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scraper.spiders.financialsSpider import FinancialsSpider
def GetFinancials(url):
settings = Settings()
settings.set('ITEM_PIPELINES', {'scrapy.pipelines.files.FilesPipeline': 1})
settings.set('FILES_STORE', 'D:/downloads/')
process = CrawlerProcess(settings)
spider = FinancialsSpider
process.crawl(spider, url = url)
process.start()
GetFinancials(["http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB"])
以下是运行主代码时的日志:
2017-02-18 15:22:38 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot)
2017-02-18 15:22:38 [scrapy.utils.log] INFO: Overridden settings: {}
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-02-18 15:22:38 [scrapy.core.engine] INFO: Spider opened
2017-02-18 15:22:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-18 15:22:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-02-18 15:22:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> (referer: None)
2017-02-18 15:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> (referer: None)
2017-02-18 15:22:40 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> referred in <None>
2017-02-18 15:22:40 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> referred in <None>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 389, in file_downloaded
self.store.persist_file(path, buf, info)
File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 54, in persist_file
with open(absolute_path, 'wb') as f:
IOError: [Errno 22] invalid mode ('wb') or filename: 'D:/full\\01958104292b4813abcda051da56e55e72d22fb9.html?t=FB'
2017-02-18 15:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200 http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB>
{'file_urls': ['http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB'],
'files': []}
2017-02-18 15:22:40 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-18 15:22:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 555,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5970,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'file_count': 1,
'file_status_count/downloaded': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 18, 14, 22, 40, 160000),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 2, 18, 14, 22, 38, 826000)}
2017-02-18 15:22:40 [scrapy.core.engine] INFO: Spider closed (finished)
感谢您的回答。
答案 0 :(得分:0)
您是否尝试过输出到CSV?
scrapy crawl nameofspider -o file.csv
答案 1 :(得分:0)
它在日志中:
IOError: [Errno 22] invalid mode ('wb') or filename: 'D:/full\\01958104292b4813abcda051da56e55e72d22fb9.html?t=FB'
在Windows上更改此路径
settings.set('FILES_STORE', 'D:\\downloads')