Question

我对Python的功能有疑问。我有一个非常大的数据集（200 GB），我将使用python迭代行，将数据存储在字典中，然后执行一些计算。最后，我将计算数据写入CSV文件。我担心的是我的电脑容量。我担心（或非常肯定）我的RAM无法存储那么大的数据集。有没有更好的办法？这是输入数据的结构：

#RIC    Date[L] Time[L] Type    ALP-L1-BidPrice ALP-L1-BidSize  ALP-L1-AskPrice ALP-L1-AskSize  ALP-L2-BidPrice ALP-L2-BidSize  ALP-L2-AskPrice ALP-L2-AskSize  ALP-L3-BidPrice ALP-L3-BidSize  ALP-L3-AskPrice ALP-L3-AskSize  ALP-L4-BidPrice ALP-L4-BidSize  ALP-L4-AskPrice ALP-L4-AskSize  ALP-L5-BidPrice ALP-L5-BidSize  ALP-L5-AskPrice ALP-L5-AskSize  TOR-L1-BidPrice TOR-L1-BidSize  TOR-L1-AskPrice TOR-L1-AskSize  TOR-L2-BidPrice TOR-L2-BidSize  TOR-L2-AskPrice TOR-L2-AskSize  TOR-L3-BidPrice TOR-L3-BidSize  TOR-L3-AskPrice TOR-L3-AskSize  TOR-L4-BidPrice TOR-L4-BidSize  TOR-L4-AskPrice TOR-L4-AskSize  TOR-L5-BidPrice TOR-L5-BidSize  TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 16000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 22000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 40000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 44000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 38000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000

这是我尝试做的事情： 1.读入ta数据并将其存储到字典中，其中包含[符号] [时间] [出价]和[询问]等键 2.在任何时间点，找到最佳买入价和最佳卖出价（这需要水平排序/在钥匙中我不知道如何进行排序）因为买入价和卖出价来自不同的交易所，我们需要找到最优惠的价格，并从最好到最差的排名，以及特定价格的数量。 3.导出到csv文件。

这是我对代码的尝试。请帮我写一下它的效率：

# this file calculate the depth up to $50,000

import csv
from math import ceil
from collections import defaultdict

# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)

# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]

# Set functions
def time_to_milli(times):
    hours = float(times.split(':')[0]) * 60 * 60 * 1000000
    minutes = float(times.split(':')[1]) * 60 * 1000000
    seconds = float(times.split(':')[2]) * 1000000
    milliseconds = float(times.split('.')[1])
    timestamp = hours + minutes + seconds + milliseconds
    return timestamp


# Extract data
for i in reader:
    if not bool(date):
        date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
    security = i['#RIC'].split('.')[0]
    exchange = i['#RIC'].split('.')[1]
    timestamp = float(time_to_milli(i['Time[L]']))
    bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
    # input bid price and bid size
    exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
    # input ask price and ask size
    exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
    #I don't know how to do this

Answer 1

根据您告诉我们的内容，您可以执行以下操作：

import csv
with open("path/to/my_dataset", 'r') as input_f, open("output.csv", 'a') as output_f:
    # Keep reading lines from input data until you run out
    for line in f:
        # do processing and add to processed
        processed = []

        # write processed data to output file
        csv.writer(output_f).writerow(processed)

python中的大字典超出RAM容量

1 个答案: