Question

我有一个具有以下格式的数据框：

d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
    df = pd.DataFrame(data=d)


print(df)
     id1    id2    score
0     a      a       1        
1     a      b       2             
3     b      b       3        
4     b      c       4

数据框有超过 10 亿行，它表示 id1 和 id2 列中对象之间的成对距离分数。我不需要所有对象对组合，对于 id1 中的每个对象（大约有 40k 个唯一 ID）我只想保留前 100 个最近（最小）距离分数

我运行的代码如下：

df = df.groupby(['id1'])['score'].nsmallest(100)

此代码的问题是每次尝试运行时都会遇到内存错误

MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64

我假设这是因为在后台熊猫现在正在为 group by 的结果创建一个新的数据框，但现有的数据框仍然保存在内存中。

我只取每个 id 的前 100 个的原因是为了减小数据框的大小，但我似乎在执行该过程时实际上占用了更多空间。

有什么办法可以过滤掉这些数据而不占用更多内存？

所需的输出将是这样的（假设前 1 名而不是前 100 名）

     id1    id2    score
0     a      a       1        
1     b      b       3

关于原始 df 的一些附加信息：

df.count()
permid_1    1144468900
permid_2    1144468900
distance    1144468900
dtype: int64

df.dtypes
permid_1      int64
permid_2      int64
distance    float64

df.shape
dtype: object
(1144468900, 3)

id1 & id2 unique value counts: 33,830

Answer 1

我无法测试此代码，缺少您的数据，但也许可以尝试以下操作：

indicies = []
for the_id in df['id1'].unique():
    scores = df['score'][df['id1'] == the_id]
    min_subindicies = np.argsort(scores.values)[:100]  # numpy is raw index only
    min_indicies = scores.iloc[min_subindicies].index  # convert to pandas indicies
    indicies.extend(min_indicies)

df = df.loc[indicies]

描述性地，在每个唯一 ID (the_id) 中，提取匹配的分数。然后找到最小的 100 个原始索引。选择这些索引，然后从原始索引映射到 Pandas 索引。将 Pandas 索引保存到您的列表中。然后在最后，pandas 索引上的子集。

iloc 确实接受列表输入。 some_series.iloc 应该与 some_series.values 正确对齐，这应该允许它工作。像这样间接存储索引应该会显着提高内存效率。

df['score'][df['id1'] == the_id] 应该比 df.loc[df['id1'] == the_id, 'score'] 更有效地工作。它不是获取整个数据框并对其进行屏蔽，而是仅获取数据框的 score 列并将其屏蔽以匹配 ID。如果您想立即释放更多内存，您可能希望在每个循环结束时 del scores。

Answer 2

您可以尝试以下操作：

df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])

df = df.loc[df["dummy_key"]]

按升序排序（最小的在顶部），先分组，然后按分数。
您添加列以指示当前 id1 是否与后面的 100 行不同（如果不是 - 您的行按顺序是 101+）。
您从 2 中按列过滤。

Answer 3

正如 Aryerez 在评论中概述的那样，您可以执行以下操作：

closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for 
    id1 in set(df['id1'])])

你也可以

def get_hundredth(id1):
    sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
    return sub_df.iloc[100]['score']

hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}

def check_distance(row):
    return row['score'] <= hundredth_dict[row['id1']]

closest = df.loc[df.apply(check_distance, axis = 1)

另一种策略是查看过滤掉超过阈值的距离如何影响数据帧。即取

   low_scores = df.loc[df['score']<threshold]

对于某些合理的阈值，这是否会显着减小数据帧的大小？您需要一个阈值，使数据框足够小以供使用，但为每个 id1 保留最低 100 分。

您可能还想了解根据距离度量可以进行哪些优化。可能有专门针对余弦相似度的算法。

Answer 4

对于具有 (1144468900, 3) 唯一值计数的给定形状 33,830，id1 和 id2 列是分类列的良好候选者，将它们转换为分类数据类型，并且会将这两列的内存需求减少大约 1144468900/33,830 = 33,830 倍，然后执行您想要的任何聚合。

df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)

熊猫按组过滤最小

4 个答案: