我有一个大型数据集(10个12gb csv文件),它有25列,并希望将其转换为6列的数据集。前3列保持不变,而第4列是变量名,其余列包含数据。以下是我的意见:
#RIC Date[L] Time[L] Type L1-BidPrice L1-BidSize L1-AskPrice L1-AskSize L2-BidPrice L2-BidSize L2-AskPrice L2-AskSize L3-BidPrice L3-BidSize L3-AskPrice L3-AskSize L4-BidPrice L4-BidSize L4-AskPrice L4-AskSize L5-BidPrice L5-BidSize L5-AskPrice L5-AskSize
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:10.8 Market Depth 5.29 50000 5.3 44000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 32000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
HOU.ALP 20150901 30:12.1 Market Depth 5.29 50000 5.3 38000 5.28 50000 5.31 50000 5.27 50000 5.32 50000 5.26 50000 5.33 50000 5.34 50000
我会将其转换为:
#RIC Date[L] Time[L] level Bid_price bid_volume Ask_price Ask_volume
HOU.ALP 20150901 30:10.8 L1 5.29 50000 5.3 50000
HOU.ALP 20150901 30:10.8 L2 5.28 50000 5.31 50000
HOU.ALP 20150901 30:12.1 L3 5.27 50000 5.32 50000
HOU.ALP 20150901 30:12.1 L4 5.26 50000 5.33 50000
HOU.ALP 20150901 30:12.1 L5
HOU.ALP 20150901 30:12.1 L1 5.29 50000 5.3 50000
HOU.ALP 20150901 30:12.1 L2 5.28 44000 5.31 50000
HOU.ALP 20150901 30:12.1 L3 5.27 48000 5.32 50000
HOU.ALP 20150901 30:12.1 L4 5.26 50000 5.33 50000
这是我对编码的尝试。我想我必须使用字典写入csv文件
def depth_data_transformation(input_file_list, output_file):
for file in input_file_list:
file_to_open = '%s.csv' %file
with open(file_to_open) as f, open(output_file, "w") as out:
next(f) # skip header
cols = ["#RIC", "Date[L]", "Time[L]", "level", "Bid_price", "bid_volume", "Ask_price", "Ask_volume"]
wr = csv.writer(out)
wr.writerow(cols)
for row in csv.reader(f):
# get all but first three cols
it = row[4:]
# zip_longest(*[iter(it)] * 4, fillvalue="") -> group into 4's, add empty string for missing values
for ind, t in enumerate(izip_longest(*[iter(it)] * 4, fillvalue=""), 1):
# first 3 cols, level and group all in one row/list.
wr.writerow(row[:3]+ ["l{}".format(ind)] + list(t))
答案 0 :(得分:1)
您需要对级别进行分组,即L1-BidPrice L1-BidSize L1-AskPrice L1-AskSize
并将每个级别写入新行:
import csv
from itertools import zip_longest # izip_longest python2
with open("infile.csv") as f, open("out.csv", "w") as out:
next(f) # skip header
cols = ["#RIC", "Date[L]", "Time[L]", "level", "Bid_price", "bid_volume", "Ask_price", "Ask_volume"]
wr = csv.writer(out)
wr.writerow(cols)
for row in csv.reader(f):
# get all but first three cols.
it = row[4:]
# zip_longest(*[iter(it)] * 4, fillvalue="") -> group into 4's, add empty string for missing values
for ind, t in enumerate(zip_longest(*[iter(it)] * 4, fillvalue=""), 1):
# first 3 cols, level and group all in one row/list.
wr.writerow(row[:3]+ ["l{}".format(ind)] + list(t))
哪会给你:
#RIC,Date[L],Time[L],level,Bid_price,bid_volume,Ask_price,Ask_volume
HOU.ALP,20150901,30:10.8,l1,5.29,50000,5.3,32000
HOU.ALP,20150901,30:10.8,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:10.8,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:10.8,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:10.8,l5,5.34,50000,,
HOU.ALP,20150901,30:10.8,l1,5.29,50000,5.3,44000
HOU.ALP,20150901,30:10.8,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:10.8,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:10.8,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:10.8,l5,5.34,50000,,
HOU.ALP,20150901,30:12.1,l1,5.29,50000,5.3,32000
HOU.ALP,20150901,30:12.1,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:12.1,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:12.1,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:12.1,l5,5.34,50000,,
HOU.ALP,20150901,30:12.1,l1,5.29,50000,5.3,38000
HOU.ALP,20150901,30:12.1,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:12.1,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:12.1,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:12.1,l5,5.34,50000,,
在for ind, t in enumerate(zip_longest(*[iter(it)] * 4, fillvalue=""), 1)
中,起始索引为1的 enumerate
会跟踪我们所在的组/级别。
zip_longest(*[iter(it)] * 4, fillvalue="")
将列组分为L1-BidPrice,L1-BidSize,L1-AskPrice,L1-AskSize
,L2-BidPrice,L2-BidSize,L2-AskPrice,L2-AskSize
等各个部分..一直到Ln-..
您的预期输出中有HOU.ALP 20150901 30:10.8 L1 5.29 50000 5.3 50000
,但是{000}输入中的值为L1-AskSize
,每行有5个级别,您还有8列,所以我认为您的预期输出是错误的。< / p>