我希望引用我的抓取数据。但是我的Scrapy脚本没有做到这一点(使用Scrapy's parameter quotechar
)并输出了错误回溯。
我写了一个Scrapy script,可在任何YouTube频道上抓取有关直播视频的数据。
它得到url
,date
,title
和channel
的名字。
要从终端运行,请输入:
python3 my_code.py
它将数据保存在./output/youtube-dl_livestream_scraper.csv
中,然后登录./output/youtube-dl_livestream_scraper.log
可以,但不是100%。
问题是在@gangabass注释之后,我的代码将数据删除,现在使用引号。但是存在回溯错误。quotechar
没有引用我的数据。
我的my_code.py
:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.response import open_in_browser
from scrapy.exporters import CsvItemExporter
from pprint import pprint
import datetime
import json
import csv
class SpiderYoutubeLivestreamingScraper(scrapy.Spider):
name = 'SpiderYoutubeLivestreamingScraper'
start_urls=['https://www.youtube.com/channel/UC2fVSthyWxWSjsiEAHPzriQ/videos',]
def parse(self, response):
#print('parse starts')
xpath_path = '//script[contains(., "/watch?v=")]/text()'
links = response.xpath(xpath_path).extract()
line_with_json = links[0]
line_with_json = line_with_json.splitlines()
line_with_json = line_with_json[1]
line_with_json = line_with_json.replace('};','}')
line_with_json = line_with_json.replace(' window["ytInitialData"] = ','')
result_json = json.loads(line_with_json)
total_videos = len(result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'])
for i in range(0, total_videos):
try:
exist_live = result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][i]['gridVideoRenderer']['badges'][0]['metadataBadgeRenderer']['label']
if exist_live == 'LIVE NOW':
yield{
'title' : result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][i]['gridVideoRenderer']['title']['simpleText'],
'url' : response.urljoin(result_json['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][i]['gridVideoRenderer']['navigationEndpoint']['commandMetadata']['webCommandMetadata']['url']),
'channel' : result_json['header']['c4TabbedHeaderRenderer']['title'],
'date' : datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S-03:00'),
}
except KeyError:
pass
#print('parse ends')
class QuoteAllDialect(csv.excel):
quoting = csv.QUOTE_ALL
class MyCsvItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs['encoding'] = 'utf-8'
kwargs['delimiter'] = ';'
kwargs['quotechar'] = '"'
#kwargs.update({'dialect': QuoteAllDialect})
kwargs['dialect'] = QuoteAllDialect
super(MyCsvItemExporter, self).__init__(*args, **kwargs)
process = CrawlerProcess({
'USER_AGENT' : 'Mozilla/5.0 '
+ '(Windows NT 10.0; Win64; x64) '
+ 'Gecko/20100101 '
+ 'Firefox/61.0',
'FEED_FORMAT' : 'csv',
#'FEED_EXPORTERS' : {'csv': 'scrapy.exporters.CsvItemExporter'},
'FEED_EXPORTERS' : {'csv': 'my_code.MyCsvItemExporter'},
'FEED_URI' : './output/youtube-dl_livestream_scraper.csv',
'LOG_FILE' : './output/youtube-dl_livestream_scraper.log',
})
process.crawl(SpiderYoutubeLivestreamingScraper)
process.start()
我的日志./output/youtube-dl_livestream_scraper.log
是:
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 17.5.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.1.4, Platform Linux-4.15.0-58-generic-x86_64-with-Ubuntu-18.04-bionic
2019-08-20 17:06:23 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': './output/youtube-dl_livestream_scraper.csv', 'LOG_FILE': './output/youtube-dl_livestream_scraper.log', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/61.0'}
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2019-08-20 17:06:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]], pyOpenSSL 17.5.0 (OpenSSL 1.1.1 11 Sep 2018), cryptography 2.1.4, Platform Linux-4.15.0-58-generic-x86_64-with-Ubuntu-18.04-bionic
2019-08-20 17:06:23 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': './output/youtube-dl_livestream_scraper.csv', 'LOG_FILE': './output/youtube-dl_livestream_scraper.log', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20100101 Firefox/61.0'}
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-20 17:06:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-20 17:06:23 [scrapy.core.engine] INFO: Spider opened
2019-08-20 17:06:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-20 17:06:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-08-20 17:06:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/channel/UC2fVSthyWxWSjsiEAHPzriQ/videos> (referer: None)
2019-08-20 17:06:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/channel/UC2fVSthyWxWSjsiEAHPzriQ/videos>
{'title': '? Live Stream - Japanese / Lo Fi / Study / Relax Music | 24x7 Stream', 'url': 'https://www.youtube.com/watch?v=jGIdW3sp-NM', 'channel': 'Mr_MoMo Music', 'date': '2019-08-20T17:06:26-03:00'}
2019-08-20 17:06:26 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-20 17:06:26 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: ./output/youtube-dl_livestream_scraper.csv
2019-08-20 17:06:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 288,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 417888,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 8, 20, 20, 6, 26, 64283),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 8,
'memusage/max': 46858240,
'memusage/startup': 46858240,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 8, 20, 20, 6, 23, 378589)}
2019-08-20 17:06:26 [scrapy.core.engine] INFO: Spider closed (finished)
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-20 17:06:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-20 17:06:26 [scrapy.core.engine] INFO: Spider opened
2019-08-20 17:06:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-20 17:06:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
我的输出./output/youtube-dl_livestream_scraper.csv
现在是(所需,引用):
"title";"url";"channel";"date"
"? Live Stream - Japanese / Lo Fi / Study / Relax Music | 24x7 Stream";"https://www.youtube.com/watch?v=jGIdW3sp-NM";"Mr_MoMo Music";"2019-08-20T17:06:26-03:00"
我的脚本输出一个追溯错误:
Traceback (most recent call last):
File "my_code.py", line 77, in <module>
process.start()
File "/usr/lib/python3/dist-packages/scrapy/crawler.py", line 291, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
那么如何处理此运行时错误?
也许此错误是在运行时通过以下行生成的:'FEED_EXPORTERS' : {'csv': 'my_code.MyCsvItemExporter'},
,但我不确定。