我正在尝试抓取https://coinmarketcap.com中的所有历史硬币数据。所以,我正在尝试使用scrapy来抓取数据。我能够从网站上抓取所有数据,但我无法全部保存。它只能节省大约2000个条目,而实际上,它可能超过20000个。另外,我认为我编写的代码可以优化,但我无法这样做。
文件夹格式为:
这是utils.py文件代码:
import pandas as pd
from datetime import date
import re
today = str(date.today()).replace("-","")
def sub(s):
s = re.sub(r"\s+", '-', s)
return s
def url(s):
s = 'coinmarketcap.com/currencies/'+s+'/historical-data/?start=20130428&end='+today
return s
def append(s):
s = 'https://'+s
return s
def load():
data = pd.read_csv('coins.csv')
data.drop('Unnamed: 0', inplace = True, axis = 1)
data['Coin'] = data['Coin'].apply(lambda x : sub(x))
data['URL'] = data['Coin'].apply(lambda x : url(x))
data['start'] = data['URL'].apply(lambda x : append(x))
return data
这是hist.py文件代码:
import scrapy
import pandas as pd
import utils
data = utils.load()
class CoinSpider(scrapy.Spider):
name = 'coinspider'
allowed_domains = data['URL']
start_urls = data['start']
def parse(self, response):
scraped_info = {
'title' : response.css('.table tr ::text').extract()
}
title = response.css('.table tr ::text').extract()
data = pd.DataFrame({'Data' : title})
data.to_csv('historical.csv', sep = ',')
yield scraped_info
使用以下命令运行上面的scrapy文件:
scrapy runspider hist.py
这是csv文件链接: https://drive.google.com/file/d/13UR5TWGEfz124R9yRaYvafbfxGvCZ6vZ/view?usp=sharing
感谢任何帮助!
答案 0 :(得分:1)
问题可能在于您为每个已抓取的网址覆盖输出.csv文件。
尝试替换
data.to_csv('historical.csv', sep = ',')
与
with open('historical.csv', 'a') as f:
data.to_csv(f, sep = ',', header=False)
编辑:
curr = response.url.split('/')[4] # get name of current currency
with open('historical'+curr+'.csv', 'a') as f:
data.to_csv(f, sep = ',', header=False)
这会将数据附加到文件中。