Question

I have a large amount of data in an HDFStore (as a table), on the order of 80M rows with 1500 columns. Column A has integer values ranging between 1 and 40M or so. The values in column A are not unique and there may be between 1 and 30 rows with the same column A value. In addition, all rows which share a common value in column A will also have a common value in column B (not the same value as column A though).

I would like to do a select against the table to get a list of column A values and their corresponding column B values. The equivalent SQL statement would be something like SELECT DISTINCT ColA, ColB FROM someTable What are some ways to achieve this? Can it be done such that the results of the query are stored directly into another table in the HDF5Store?

Answer 1

阻止算法

一个解决方案是查看使用阻塞算法实现Pandas API子集的dask.dataframe。

import dask.dataframe as dd
df = dd.read_hdf('myfile.hdf5', '/my/data', columns=['A', 'B'])
result = df.drop_duplicates().compute()

在这种特殊情况下，dd.DataFrame.drop_duplicates将拉出一个中等大小的行块，执行pd.DataFrame.drop_duplicates调用并存储（希望更小）的结果。它会对所有块执行此操作，将它们连接起来，然后对连接的中间结果执行最终pd.DataFrame.drop_duplicates。您也可以只使用for循环执行此操作。您的情况有点奇怪，因为您还拥有大量独特元素。即使使用阻塞算法，这仍然是计算的挑战。值得一试。

列存储

或者，您应该考虑查看可以将数据存储为单个列的存储格式。这样，您只需收集所需的两列A和B，而不必浏览磁盘上的所有数据。可以说，你应该能够将8000万行放入内存中的单个Pandas数据帧中。您可以考虑bcolz。

Answer 2

要说清楚，你尝试过类似的东西，它不起作用？

import pandas
import tables
import pandasql

检查您的商店是否是您认为的类型：

in：store

out：<class 'pandas.io.pytables.HDFStore'>

您可以从商店中选择一个表格，如下所示：

df = store.select('tablename')

检查它是否有效：

in：type(tablename)

out：pandas.core.frame.DataFrame

然后你可以这样做：

q = """SELECT DISTINCT region, segment FROM tablename"""

distinct_df = (pandasql.sqldf(q, locals()))

（请注意，您将通过这种方式获得弃用警告，但它确实有效）

How to SELECT DISTINCT from a pandas hdf5store?

2 个答案:

阻止算法

列存储