Question

我正在寻找加快这一过程的方法。我有它的功能，但需要几天时间才能完成我有一年一天的数据文件。而且，我想将它们组合成一个HDF5文件，每个数据标签都有一个节点（数据标签）数据如下所示：

a,1468004920,986.078
a,1468004921,986.078
a,1468004922,987.078
a,1468004923,986.178
a,1468004924,984.078
b,1468004920,986.078
b,1468004924,986.078
b,1468004928,987.078
c,1468004924,98.608
c,1468004928,97.078
c,1468004932,98.078

请注意，每个数据标记的条目数不同，更新频率也不同。每个实际数据文件在每个单日文件中有大约400万行，大约4000个不同的标签标签，然后我有一年的数据。
以下代码执行我想要的操作。但是为每个文件运行它需要数天才能完成。我正在寻找加快这一点的建议：

import pandas as pd
import datetime
import pytz
MI = pytz.timezone('US/Central')

def readFile(file_name):
    tmp_data=pd.read_csv(file_name,index_col=[1],names=['Tag','Timestamp','Value'])
    tmp_data.index=pd.to_datetime(tmp_data.index,unit='s')
    tmp_data.index.tz=MI
    tmp_data['Tag']=tmp_data['Tag'].astype('category')
    tag_names=tmp_data.Tag.unique()
    for idx,name in enumerate(tag_names):
        tmp_data.loc[tmp_data.Tag==name].Value.to_hdf('test.h5',name,complevel=9, complib='blosc',format='table',append=True)

for name in ['test1.csv']:
    readFile(name)

基本上，我试图做的是打开＆＃34;打开＆＃34; CSV数据，因此每个标签在HDF5文件中是分开的。所以，我想要标记所有数据＆＃34; a＆＃34;到一年的hdf5文件的一个叶子，以及所有＆＃34; b＆＃34;数据到下一个叶子等等。所以，我需要在每个365文件上运行上面的代码。我确实尝试过和没有压缩，我也试过index = False。但是，似乎都没有产生很大的影响。

Answer 1

我这样做：

MI = pytz.timezone('US/Central')

tmp_data=pd.read_csv('test1.txt',index_col=[1],names=['Tag','Timestamp','Value'])
tmp_data.index=pd.to_datetime(tmp_data.index,unit='s')
tmp_data.index.tz=MI

hdf_key = 'my_key'

store = pd.HDFStore('/path/to/file.h5')

for loop which processes all your CSV files:
    # pay attention at index=False - we want to index everything at the end
    store.append(hdf_key, tmp_data, complevel=9, complib='blosc', append=True, index=False)

# all CSV files have been processed, let's index everything...
store.create_table_index(hdf_key, columns=['Tag','Value'], optlevel=9, kind='full')

加快在Pandas中将CSV转换为HDF5

1 个答案: