Question

在以下数据集中，用groupby（['Type']）count <3到3复制行的最佳方法是什么。df是输入，而df1是我想要的结果。您会看到df的第3行在末尾重复了2次。这只是一个示例套牌。实际数据大约有2000万行和400K唯一类型，因此需要一种有效地做到这一点的方法。

>>> df
  Type  Val
0    a    1
1    a    2
2    a    3
3    b    1
4    c    3
5    c    2
6    c    1
>>> df1
  Type  Val
0    a    1
1    a    2
2    a    3
3    b    1
4    c    3
5    c    2
6    c    1
7    b    1
8    b    1

考虑使用以下内容，但不知道编写函数的最佳方法。

df.groupby('Type').apply(func)

谢谢。

Answer 1

将value_counts与map和repeat结合使用：

counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(), 
               sort=False, ignore_index=True)[['Type','Val']]

print(df)

  Type  Val
0    a    1
1    a    2
2    a    3
3    b    1
4    c    3
5    c    2
6    c    1
7    b    1
8    b    1

注意： sort=False中有append的{{1}}，如果使用较低版本则将其删除。

编辑：：如果数据包含多个val列，则使所有列作为索引，除一列外，重复然后将reset_index设置为：

pandas>=0.23.0

熊猫数据框中重复出现的低行

1 个答案: