Question

我有一个数据框架，其中包含数百万行具有唯一索引的行和一个具有多个重复值的column（'b'）。

我想生成一个没有重复数据的数据框，但是我不想丢失索引信息。我希望新数据框具有一个索引，该索引是索引（"old_index1,old_index2"）的串联，其中'b'具有重复的值，但对于'b'具有唯一值的行保持不变。像在keep=first策略中一样，“ b”列的值应保持不变。下面的示例。

输入数据框：

df = pd.DataFrame(data = [[1,"non_duplicated_1"],
                          [2,"duplicated"],
                          [2,"duplicated"],
                          [3,"non_duplicated_2"],
                          [4,"non_duplicated_3"]],
                  index=['one','two','three','four','five'],
                  columns=['a','b'])

所需的输出：

             a                 b
one          1  non_duplicated_1
two,three    2        duplicated
four         3  non_duplicated_2
five         4  non_duplicated_3

实际数据帧很大，因此我想避免非矢量化操作。

我发现这非常困难...有什么想法吗？

Answer 1

设置

dct = {'index': ','.join, 'a': 'first'}

您可以reset_index使用groupby，尽管我不清楚为什么要这么做：

df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')

                          b  a
index
one        non_duplicated_1  1
two,three        duplicated  2
four       non_duplicated_2  3
five       non_duplicated_3  4

Answer 2

您可以在索引列上使用transform（使用reset_index之后）。然后，在列b中放入重复项：

df.index = df.reset_index().groupby('b')['index'].transform(','.join)

df.drop_duplicates('b',inplace=True)

>>> df
           a                 b
index                         
one        1  non_duplicated_1
two,three  2        duplicated
four       3  non_duplicated_2
five       4  non_duplicated_3

基于列值的熊猫重新编制索引任务

2 个答案: