有人可以向我解释如何通过python脚本将此脚本中的抓取数据导出到csv吗?看来我已经成功地通过看到的输出抓取了数据,但是我不确定如何有效地将其放入csv。谢谢。
import scrapy
import scrapy.crawler as crawler
class RedditbotSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['www.reddit.com/r/gameofthrones/']
start_urls = ['https://www.reddit.com/r/gameofthrones/']
output = 'output.csv'
def parse(self, response):
yield {'a': 'b'}
#Extracting the content using css selectors
titles = response.css('.title.may-blank::text').extract()
votes = response.css('.score.unvoted::text').extract()
times = response.css('time::attr(title)').extract()
comments = response.css('.comments::text').extract()
#Give the extracted content row wise
for item in zip(titles,votes,times,comments):
#create a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
'vote' : item[1],
'created_at' : item[2],
'comments' : item[3],
}
#yield or give the scraped info to scrapy
yield scraped_info
def run_crawler(spider_cls):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = crawler.CrawlerRunner()
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = run_crawler(RedditbotSpider)
@deferred.addCallback
def success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
@deferred.addErrback
def error(failure):
raise failure.value
return deferred
test_scrapy_crawler()
答案 0 :(得分:4)
您可以在运行Spider之前在设置中包括Feed Exporter配置。因此,对于您的代码,请尝试更改:
runner = crawler.CrawlerRunner()
使用
runner = crawler.CrawlerRunner({
'FEED_URI': 'output_file.csv',
'FEED_FORMAT': 'csv',
})
输出项目应位于运行该脚本的同一目录中的output_file.csv
文件中。