如何通过我的scrapy脚本引用csv数据?

时间:2019-08-15 17:32:19

标签: python-3.x csv web-scraping scrapy delimiter

我希望引用我的抓取数据。但是我的Scrapy脚本没有做到这一点(使用Scrapy's parameter quotechar)并输出了错误回溯。

我写了一个Scrapy script,可在任何YouTube频道上抓取有关直播视频的数据。

它得到urldatetitlechannel的名字。

要从终端运行,请输入:

python3 my_code.py

它将数据保存在./output/youtube-dl_livestream_scraper.csv中,然后登录./output/youtube-dl_livestream_scraper.log 可以,但不是100%。

问题是quotechar没有引用我的数据。在@gangabass注释之后,我的代码将数据删除,现在使用引号。但是存在回溯错误。

我的my_code.py

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.response import open_in_browser
from scrapy.exporters import CsvItemExporter
from pprint import pprint
import datetime
import json
import csv

class SpiderYoutubeLivestreamingScraper(scrapy.Spider):
    name = 'SpiderYoutubeLivestreamingScraper'
    start_urls=['https://www.youtube.com/channel/UC2fVSthyWxWSjsiEAHPzriQ/videos',]

    def parse(self, response):

        #print('parse starts')

        xpath_path = '//script[contains(., "/watch?v=")]/text()'

        links = response.xpath(xpath_path).extract()
        line_with_json = links[0]
        line_with_json = line_with_json.splitlines()
        line_with_json = line_with_json[1]
        line_with_json = line_with_json.replace('};','}')
        line_with_json = line_with_json.replace('    window["ytInitialData"] = ','')

        result_json = json.loads(line_with_json)

        total_videos = len(result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'])

        for i in range(0, total_videos):
            try:
                exist_live = result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][i]['gridVideoRenderer']['badges'][0]['metadataBadgeRenderer']['label']
                if exist_live == 'LIVE NOW':
                    yield{
                        'title' : result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][i]['gridVideoRenderer']['title']['simpleText'],
                        'url' : response.urljoin(result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][i]['gridVideoRenderer']['navigationEndpoint']['commandMetadata']['webCommandMetadata']['url']),
                        'channel' : result_json['header']['c4TabbedHeaderRenderer']['title'],
                        'date' : datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S-03:00'),
                    }
            except KeyError:
                pass

        #print('parse ends')


class QuoteAllDialect(csv.excel):
    quoting = csv.QUOTE_ALL


class MyCsvItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        kwargs['encoding'] = 'utf-8'
        kwargs['delimiter'] = ';'
        kwargs['quotechar'] = '"'
        #kwargs.update({'dialect': QuoteAllDialect})
        kwargs['dialect'] = QuoteAllDialect
        super(MyCsvItemExporter, self).__init__(*args, **kwargs)

process = CrawlerProcess({
    'USER_AGENT'  : 'Mozilla/5.0 '
                        + '(Windows NT 10.0; Win64; x64) '
                        + 'Gecko/20100101 '
                        + 'Firefox/61.0',
    'FEED_FORMAT' : 'csv',
    #'FEED_EXPORTERS' : {'csv': 'scrapy.exporters.CsvItemExporter'},
    'FEED_EXPORTERS' : {'csv': 'my_code.MyCsvItemExporter'},
    'FEED_URI'    : './output/youtube-dl_livestream_scraper.csv',
    'LOG_FILE'    : './output/youtube-dl_livestream_scraper.log',
})

process.crawl(SpiderYoutubeLivestreamingScraper)
process.start()

我的日志./output/youtube-dl_livestream_scraper.log是:

2019-08-20 17:06:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 17.5.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.1.4, Platform Linux-4.15.0-58-generic-x86_64-with-Ubuntu-18.04-bionic
2019-08-20 17:06:23 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': './output/youtube-dl_livestream_scraper.csv', 'LOG_FILE': './output/youtube-dl_livestream_scraper.log', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/61.0'}
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 17.5.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.1.4, Platform Linux-4.15.0-58-generic-x86_64-with-Ubuntu-18.04-bionic
2019-08-20 17:06:23 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': './output/youtube-dl_livestream_scraper.csv', 'LOG_FILE': './output/youtube-dl_livestream_scraper.log', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/61.0'}
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-20 17:06:23 [scrapy.core.engine] INFO: Spider opened
2019-08-20 17:06:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-20 17:06:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-08-20 17:06:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/channel/UC2fVSthyWxWSjsiEAHPzriQ/videos> (referer: None)
2019-08-20 17:06:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/channel/UC2fVSthyWxWSjsiEAHPzriQ/videos>
{'title': '? Live Stream - Japanese / Lo Fi / Study / Relax Music | 24x7 Stream', 'url': 'https://www.youtube.com/watch?v=jGIdW3sp-NM', 'channel': 'Mr_MoMo Music', 'date': '2019-08-20T17:06:26-03:00'}
2019-08-20 17:06:26 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-20 17:06:26 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: ./output/youtube-dl_livestream_scraper.csv
2019-08-20 17:06:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 288,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 417888,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 8, 20, 20, 6, 26, 64283),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'memusage/max': 46858240,
 'memusage/startup': 46858240,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 8, 20, 20, 6, 23, 378589)}
2019-08-20 17:06:26 [scrapy.core.engine] INFO: Spider closed (finished)
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-20 17:06:26 [scrapy.core.engine] INFO: Spider opened
2019-08-20 17:06:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-20 17:06:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

我的输出./output/youtube-dl_livestream_scraper.csv现在是(所需,引用):

"title";"url";"channel";"date"
"? Live Stream - Japanese / Lo Fi / Study / Relax Music | 24x7 Stream";"https://www.youtube.com/watch?v=jGIdW3sp-NM";"Mr_MoMo Music";"2019-08-20T17:06:26-03:00"

我的脚本输出一个追溯错误:

Traceback (most recent call last):
  File "my_code.py", line 77, in <module>
    process.start()
  File "/usr/lib/python3/dist-packages/scrapy/crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

那么如何处理此运行时错误?

也许此错误是在运行时通过以下行生成的:'FEED_EXPORTERS' : {'csv': 'my_code.MyCsvItemExporter'},,但我不确定。

0 个答案:

没有答案