我对Python的功能有疑问。 我有一个非常大的数据集(200 GB),我将使用python迭代行,将数据存储在字典中,然后执行一些计算。最后,我将计算数据写入CSV文件。 我担心的是我的电脑容量。我担心(或非常肯定)我的RAM无法存储那么大的数据集。有没有更好的办法? 这是输入数据的结构:
#RIC Date[L] Time[L] Type ALP-L1-BidPrice ALP-L1-BidSize ALP-L1-AskPrice ALP-L1-AskSize ALP-L2-BidPrice ALP-L2-BidSize ALP-L2-AskPrice ALP-L2-AskSize ALP-L3-BidPrice ALP-L3-BidSize ALP-L3-AskPrice ALP-L3-AskSize ALP-L4-BidPrice ALP-L4-BidSize ALP-L4-AskPrice ALP-L4-AskSize ALP-L5-BidPrice ALP-L5-BidSize ALP-L5-AskPrice ALP-L5-AskSize TOR-L1-BidPrice TOR-L1-BidSize TOR-L1-AskPrice TOR-L1-AskSize TOR-L2-BidPrice TOR-L2-BidSize TOR-L2-AskPrice TOR-L2-AskSize TOR-L3-BidPrice TOR-L3-BidSize TOR-L3-AskPrice TOR-L3-AskSize TOR-L4-BidPrice TOR-L4-BidSize TOR-L4-AskPrice TOR-L4-AskSize TOR-L5-BidPrice TOR-L5-BidSize TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 16000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 22000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 40000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 44000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 46000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 38000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000 5.29 50000 5.3 36000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
这是我尝试做的事情:
这是我对代码的尝试。请帮我写一下它的效率:
# this file calculate the depth up to $50,000
import csv
from math import ceil
from collections import defaultdict
# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)
# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]
# Set functions
def time_to_milli(times):
hours = float(times.split(':')[0]) * 60 * 60 * 1000000
minutes = float(times.split(':')[1]) * 60 * 1000000
seconds = float(times.split(':')[2]) * 1000000
milliseconds = float(times.split('.')[1])
timestamp = hours + minutes + seconds + milliseconds
return timestamp
# Extract data
for i in reader:
if not bool(date):
date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
security = i['#RIC'].split('.')[0]
exchange = i['#RIC'].split('.')[1]
timestamp = float(time_to_milli(i['Time[L]']))
bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
# input bid price and bid size
exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
# input ask price and ask size
exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
#I don't know how to do this
答案 0 :(得分:1)
这很难回答,因为当您进入非常大的数据集之类的东西时,如何存储事物非常重要。同样重要的是如何定义数据集大小。
如果这是一个消耗200GB的CSV文件,很可能你可以将它以二进制形式存储在你的计算机上,它将占用8GB的容量。然后,如果你为每个数字数据使用一个python对象,它可能接近1TB的实际ram。
如果你有一个可能适合ram的数据集,并希望使其适合ram,那么首先使用像numpy这样的东西,它提供了一个基于本机C结构的接口。您可以使用struct.pack之类的东西来帮助将数据强制转换为C类型。
如果您的数据集无法适应ram,您需要查看分析该数据的其他方法。这些其他方式是数据库和/或不同语言,如" R"。你也可以出去购买一台有大量公羊的服务器。