如何在Scrapy中迭代多个URL并在每次迭代后保存它们

时间:2018-03-10 20:45:20

标签: python pandas web-scraping scrapy scrapy-spider

我正在尝试抓取https://coinmarketcap.com中的所有历史硬币数据。所以,我正在尝试使用scrapy来抓取数据。我能够从网站上抓取所有数据,但我无法全部保存。它只能节省大约2000个条目,而实际上,它可能超过20000个。另外,我认为我编写的代码可以优化,但我无法这样做。

文件夹格式为:

  • hist.py
  • utils.py
  • coins.csv

这是utils.py文件代码:

import pandas as pd
from datetime import date
import re

today = str(date.today()).replace("-","")

def sub(s):
    s = re.sub(r"\s+", '-', s)
    return s

def url(s):
    s = 'coinmarketcap.com/currencies/'+s+'/historical-data/?start=20130428&end='+today
    return s

def append(s):
    s = 'https://'+s
    return s

def load():
    data = pd.read_csv('coins.csv')
    data.drop('Unnamed: 0', inplace = True, axis = 1)
    data['Coin'] = data['Coin'].apply(lambda x : sub(x))
    data['URL'] = data['Coin'].apply(lambda x : url(x))
    data['start'] = data['URL'].apply(lambda x : append(x))
    return data

这是hist.py文件代码:

import scrapy
import pandas as pd
import utils

data = utils.load()

class CoinSpider(scrapy.Spider):
    name = 'coinspider'
    allowed_domains = data['URL']
    start_urls = data['start']

    def parse(self, response):
        scraped_info = {
            'title' : response.css('.table tr ::text').extract()
            }

        title = response.css('.table tr ::text').extract()

        data = pd.DataFrame({'Data' : title})
        data.to_csv('historical.csv', sep = ',')
        yield scraped_info

使用以下命令运行上面的scrapy文件:

scrapy runspider hist.py

这是csv文件链接: https://drive.google.com/file/d/13UR5TWGEfz124R9yRaYvafbfxGvCZ6vZ/view?usp=sharing

感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

问题可能在于您为每个已抓取的网址覆盖输出.csv文件。

尝试替换

data.to_csv('historical.csv', sep = ',')

with open('historical.csv', 'a') as f:
    data.to_csv(f, sep = ',', header=False)

编辑:

curr = response.url.split('/')[4] # get name of current currency
with open('historical'+curr+'.csv', 'a') as f:
    data.to_csv(f, sep = ',', header=False)

这会将数据附加到文件中。