我对Python的功能有疑问。 我有一个非常大的数据集(200 GB),我将使用python迭代行,将数据存储在字典中,然后执行一些计算。最后,我将计算数据写入CSV文件。 我担心的是我的电脑容量。我担心(或非常肯定)我的RAM无法存储那么大的数据集。有没有更好的办法? 这是输入数据的结构:
#RIC Date[L] Time[L] Type ALP-L1-BidPrice ALP-L1-BidSize ALP-L1-AskPrice ALP-L1-AskSize ALP-L2-BidPrice ALP-L2-BidSize ALP-L2-AskPrice ALP-L2-AskSize ALP-L3-BidPrice ALP-L3-BidSize ALP-L3-AskPrice ALP-L3-AskSize ALP-L4-BidPrice ALP-L4-BidSize ALP-L4-AskPrice ALP-L4-AskSize ALP-L5-BidPrice ALP-L5-BidSize ALP-L5-AskPrice ALP-L5-AskSize TOR-L1-BidPrice TOR-L1-BidSize TOR-L1-AskPrice TOR-L1-AskSize TOR-L2-BidPrice TOR-L2-BidSize TOR-L2-AskPrice TOR-L2-AskSize TOR-L3-BidPrice TOR-L3-BidSize TOR-L3-AskPrice TOR-L3-AskSize TOR-L4-BidPrice TOR-L4-BidSize TOR-L4-AskPrice TOR-L4-AskSize TOR-L5-BidPrice TOR-L5-BidSize TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 16000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 22000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 40000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 44000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 38000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
这是我尝试做的事情: 1.读入ta数据并将其存储到字典中,其中包含[符号] [时间] [出价]和[询问]等键 2.在任何时间点,找到最佳买入价和最佳卖出价(这需要水平排序/在钥匙中我不知道如何进行排序)因为买入价和卖出价来自不同的交易所,我们需要找到最优惠的价格,并从最好到最差的排名,以及特定价格的数量。 3.导出到csv文件。
这是我对代码的尝试。请帮我写一下它的效率:
# this file calculate the depth up to $50,000
import csv
from math import ceil
from collections import defaultdict
# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)
# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]
# Set functions
def time_to_milli(times):
hours = float(times.split(':')[0]) * 60 * 60 * 1000000
minutes = float(times.split(':')[1]) * 60 * 1000000
seconds = float(times.split(':')[2]) * 1000000
milliseconds = float(times.split('.')[1])
timestamp = hours + minutes + seconds + milliseconds
return timestamp
# Extract data
for i in reader:
if not bool(date):
date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
security = i['#RIC'].split('.')[0]
exchange = i['#RIC'].split('.')[1]
timestamp = float(time_to_milli(i['Time[L]']))
bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
# input bid price and bid size
exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
# input ask price and ask size
exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
#I don't know how to do this
答案 0 :(得分:0)
根据您告诉我们的内容,您可以执行以下操作:
import csv
with open("path/to/my_dataset", 'r') as input_f, open("output.csv", 'a') as output_f:
# Keep reading lines from input data until you run out
for line in f:
# do processing and add to processed
processed = []
# write processed data to output file
csv.writer(output_f).writerow(processed)