Question

我有一个 90k 行和 14 列的数据。想要将重复的行（13 列）相加，除了“数量”列。

对不起，我不能把原始数据放在这里方便您检查。以下所有代码均为伪代码，请忽略明显的笔误

问题 1：

当我使用

# gp_cols is the list of 13 columns name.
# except "quantity" column is uint32, other columns converted to category.

gp_cols = ['dddd-mm', 'province', 'city', 'maker', 'brand', 'model', \
           'model-year',  'color', 'liters', 'fueltype','type', 'appType', 'domesticimported']
    
gp = df_test.groupby(by=gp_cols, as_index=True)

# when i do gp.sum() or gp['quantity'].sum() or  gp.size(), the memory error will come up:
# even i just pull out the first 10k rows, same error happend.
MemoryError: Unable to allocate 455. PiB for an array with shape (512584853784231936,) and data type int8

为什么以及如何避免这种情况？

问题 2：

当我使用 cumsum() 解决上述问题时，最终答案似乎是正确的。但是当我比较 'quantity' 和 'cumQty' 的总值时，它们并不相同。

df_test.sort_values(by=gp_cols,ignore_index=True, inplace=True)

t5 = df_test.join(gp.cumsum().rename(columns={'quantity':'cumQty'}))
t6 = t5.drop_duplicates(subset=gp_cols, keep='last')

print(t6['cumQty'].sum() )
print(t5['quantity'].sum())
print(df_test['quantity'].sum())
print(t5.shape,t6.shape)

# result after drop_duplicates are incorrect.
# t5 size is same to original data, t6 is the length of grouper. 633576 rows
#below 5001rows sample data result: drop_duplicates leads to data lost.
#37013
#39617
#39617
# (5001, 16) (2642, 16)

# run twice
#37199
#39617
#39617
#(5001, 15) (2642, 15)

# run third
#39860
#39617
#39617
#(5001, 15) (2642, 15)

# run forth
#58515
#39617
#39617
#(5001, 15) (2642, 15)


# then i tried join the 13 columns into one new reference column "key", the groupby result is correct.
t7 = df_test.iloc[:,0].astype('str')
for i in range(1, len(gp_cols)):
    t7 = t7 + "_" + df_test[gp_cols[i]].astype('str')

df_test['key'] = t7

gp7 = df_test.groupby(by="key", as_index=False)

print(gp7.sum()['quantity'].sum())
print(gp7.size())

# the quantity now is correct, same to original one's sum.  
# below is the 5001 sample rows result. no data lost by reference column created.
# 39617
# 2642 rows x 2 columns

为什么在删除重复项后，总数会发生变化。我猜当删除重复项时，会删除一些 cumsum 行。还是问题出在加入？

如何在不创建参考列和拆分参考列的情况下获得正确答案？

谢谢！

最好的问候，凯文

对于问题 1，作为 Joran 的建议，我提取了 4999 行数据来检查问题 1 和问题 2。链接可能会在一段时间后过期。

即使你只拉前1000行，代码也会导致问题1，但没有发现问题2。

5001 rows sample data, has problem1 and problem2

python pandas groupby by multi column question on sum() 和 total 加起来

0 个答案: