python中的大数据转换

时间:2016-10-18 14:20:46

标签: python csv

我有一个大型数据集(10个12gb csv文件),它有25列,并希望将其转换为6列的数据集。前3列保持不变,而第4列是变量名,其余列包含数据。以下是我的意见:

#RIC    Date[L] Time[L] Type    L1-BidPrice L1-BidSize  L1-AskPrice L1-AskSize  L2-BidPrice L2-BidSize  L2-AskPrice L2-AskSize  L3-BidPrice L3-BidSize  L3-AskPrice L3-AskSize  L4-BidPrice L4-BidSize  L4-AskPrice L4-AskSize  L5-BidPrice L5-BidSize  L5-AskPrice L5-AskSize
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 44000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 38000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000

我会将其转换为:

#RIC    Date[L] Time[L] level   Bid_price   bid_volume  Ask_price   Ask_volume
HOU.ALP 20150901    30:10.8 L1  5.29    50000   5.3 50000
HOU.ALP 20150901    30:10.8 L2  5.28    50000   5.31    50000
HOU.ALP 20150901    30:12.1 L3  5.27    50000   5.32    50000
HOU.ALP 20150901    30:12.1 L4  5.26    50000   5.33    50000
HOU.ALP 20150901    30:12.1 L5              
HOU.ALP 20150901    30:12.1 L1  5.29    50000   5.3 50000
HOU.ALP 20150901    30:12.1 L2  5.28    44000   5.31    50000
HOU.ALP 20150901    30:12.1 L3  5.27    48000   5.32    50000
HOU.ALP 20150901    30:12.1 L4  5.26    50000   5.33    50000

这是我对编码的尝试。我想我必须使用字典写入csv文件

def depth_data_transformation(input_file_list, output_file):

for file in input_file_list:
    file_to_open = '%s.csv' %file
    with open(file_to_open) as f, open(output_file, "w") as out:
        next(f) # skip header
        cols = ["#RIC", "Date[L]", "Time[L]", "level", "Bid_price", "bid_volume", "Ask_price", "Ask_volume"]
        wr = csv.writer(out)
        wr.writerow(cols)
        for row in csv.reader(f):
            # get all but first three cols
            it = row[4:]
            # zip_longest(*[iter(it)] * 4, fillvalue="") -> group into 4's, add empty string for missing values
            for ind, t in enumerate(izip_longest(*[iter(it)] * 4, fillvalue=""), 1):
               # first 3 cols, level and group all in one row/list.
                wr.writerow(row[:3]+ ["l{}".format(ind)] + list(t))                                 

1 个答案:

答案 0 :(得分:1)

您需要对级别进行分组,即L1-BidPrice L1-BidSize L1-AskPrice L1-AskSize并将每个级别写入新行:

import csv  
from itertools import zip_longest # izip_longest python2


with open("infile.csv") as f, open("out.csv", "w") as out:
    next(f) # skip header
    cols = ["#RIC", "Date[L]", "Time[L]", "level", "Bid_price", "bid_volume", "Ask_price", "Ask_volume"]
    wr = csv.writer(out)
    wr.writerow(cols)
    for row in csv.reader(f):
        # get all but first three cols.
        it = row[4:]
        # zip_longest(*[iter(it)] * 4, fillvalue="") -> group into 4's, add empty string for missing values
        for ind, t in enumerate(zip_longest(*[iter(it)] * 4, fillvalue=""), 1):
           # first 3 cols, level and group all in one row/list.
            wr.writerow(row[:3]+ ["l{}".format(ind)] + list(t))

哪会给你:

#RIC,Date[L],Time[L],level,Bid_price,bid_volume,Ask_price,Ask_volume
HOU.ALP,20150901,30:10.8,l1,5.29,50000,5.3,32000
HOU.ALP,20150901,30:10.8,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:10.8,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:10.8,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:10.8,l5,5.34,50000,,
HOU.ALP,20150901,30:10.8,l1,5.29,50000,5.3,44000
HOU.ALP,20150901,30:10.8,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:10.8,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:10.8,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:10.8,l5,5.34,50000,,
HOU.ALP,20150901,30:12.1,l1,5.29,50000,5.3,32000
HOU.ALP,20150901,30:12.1,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:12.1,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:12.1,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:12.1,l5,5.34,50000,,
HOU.ALP,20150901,30:12.1,l1,5.29,50000,5.3,38000
HOU.ALP,20150901,30:12.1,l2,5.28,50000,5.31,50000
HOU.ALP,20150901,30:12.1,l3,5.27,50000,5.32,50000
HOU.ALP,20150901,30:12.1,l4,5.26,50000,5.33,50000
HOU.ALP,20150901,30:12.1,l5,5.34,50000,,

for ind, t in enumerate(zip_longest(*[iter(it)] * 4, fillvalue=""), 1)中,起始索引为1的 enumerate 会跟踪我们所在的组/级别

zip_longest(*[iter(it)] * 4, fillvalue="") 将列组分为L1-BidPrice,L1-BidSize,L1-AskPrice,L1-AskSizeL2-BidPrice,L2-BidSize,L2-AskPrice,L2-AskSize等各个部分..一直到Ln-..

您的预期输出中有HOU.ALP 20150901 30:10.8 L1 5.29 50000 5.3 50000,但是{000}输入中的值为L1-AskSize,每行有5个级别,您还有8列,所以我认为您的预期输出是错误的。< / p>