Question

我有一个巨大的HDF5文件，我想在pandas DataFrame中加载它的一部分来执行一些操作，但我有兴趣过滤一些行。

我可以用一个例子更好地解释：

原始HDF5文件看起来像：

NSArray *indexPaths = [NSArray arraywithobject:@"indexPath1", @"indexPath2", @"indexPath3",nil];
[self.tableView reloadRowsAtIndexPaths:indexPaths withRowAnimation:UITableViewRowAnimationNone];

我要做的是将其完全按原样加载到pandas Dataframe，但只加载A B C D 1 0 34 11 2 0 32 15 3 1 35 22 4 1 34 15 5 1 31 9 1 0 34 15 2 1 29 11 3 0 34 15 4 1 12 14 5 0 34 15 1 0 32 13 2 1 34 15 etc etc etc etc

到目前为止，我可以使用以下方式加载整个HDF5：

where A==1 or 3 or 4

我不知道如何在此处加入store = pd.HDFStore('Resutls2015_10_21.h5') df = pd.DataFrame(store['results_table'])条件。

Answer 1

hdf5文件必须以table format（而不是fixed格式）写入为了能够使用pd.read_hdf where参数进行查询。

此外，A必须是declared as a data_column：

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

或者，将所有列指定为（可查询的）数据列：

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

然后你可以使用

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

选择值列A为1,3或4的行。例如，

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

产量

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

如果你有一个很长的值列表vals，那么你可以使用字符串格式来组成正确的where参数：

where='A in {}'.format(vals)

Answer 2

您可以使用pandas.read_hdf ShellJS）和where的可选参数执行此操作。
对于here：read_hdf('store_tl.h5', 'table', where = ['index>2'])

将HDF5文件读取到带有条件的pandas DataFrame

2 个答案: