Question

我从HDF表中读取了一个随机的行子集，其中包含以下Python / Pandas代码：

hdf_store = pd.HDFStore('path_to_data.h5')
total_rows = hdf_store.get_storer('hdf_table_name').nrows

num_rows = int(total_rows * .25)
row_indices = np.random.randint(0,rows_indices,size=num_rows)

my_df = pd.read_hdf(hdf_store, 'hdf_table_name', where=pd.Index(row_indices))

稍后在程序中，我想从HDF5表中提取剩余的数据行。但以下引发了错误：

rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
   where=pd.Index(not in (row_indices)))

rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
   where=not pd.Index(row_indices))

有没有办法通过不在索引列表中的记录来提取HDF行？

因为表比我的RAM大，所以我想避免从HDF中提取所有行（即使是在块中），然后将其拆分为同时保存两个表。我可以将索引映射到另一列，并将行的子集映射到不在该列的映射值中的行。但这可能比直接查询索引慢得多。

Answer 1

您可以使用Index.difference方法。

演示：

# randomly select 25% of index elements (without duplicates `replace=False`)
sample_idx = np.random.choice(np.arange(total_rows), total_rows//4, replace=False)

# select remaining index elements
rest_idx = pd.Index(np.arange(total_rows)).difference(sample_idx)

# get rest rows by index
rest = store.select('hdf_table_name', where=rest_idx)

PS，您可以选择以块的形式选择其余行...

Python读取HDF5行，其中索引不在列表中

1 个答案: