如何将多个pandas数据帧组合到一个键/组下的HDF5对象中?

时间:2016-10-07 20:11:57

标签: pandas hdf5 dask pytables hdfstore

我正在从800 GB的大型csv解析数据。对于每行数据,我将其保存为pandas数据帧。

readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])

现在,我想将其保存为HDF5格式,并查询h5,就好像它是整个csv文件一样。

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]

到目前为止我的方法是:

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    store.append(hdf5_key, df, data_columns=csv_columns, index=False)

也就是说,我尝试将每个数据帧df保存到一个密钥下的HDF5中。但是,这失败了:

  Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'

所以,我可以先尝试将所有内容保存到一个pandas数据帧中,即

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    total_df = pd.concat([total_df, df])   # creates one big CSV

现在存储为HDF5格式

    store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)

但是,我认为我没有RAM /存储将所有csv行保存为total_df为HDF5格式。

那么,如何将每个“单行”df附加到HDF5中,使其最终成为一个大数据帧(如原始csv)?

编辑:以下是具有不同数据类型的csv文件的具体示例:

 order    start    end    value    
 1        1342    1357    category1
 1        1459    1489    category7
 1        1572    1601    category23
 1        1587    1599    category2
 1        1591    1639    category1
 ....
 15        792     813    category13
 15        892     913    category5
 ....

1 个答案:

答案 0 :(得分:1)

您的代码应该可以使用,您可以尝试以下代码:

Name: John
Age: 25
Height: 176
etc. 

如果代码有效,那么您的数据就会出现问题。