Question

我希望能够计算Pandas DataFrame中数据的描述性统计信息，但我只关心重复的条目。例如，让我们说我创建了DataFrame：

import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame


     key1  key2  data
0     1     2     5
1     2     2     6
2     3     1     2
3     1     2     6
4     2     2     1
5     3     4     6
6     2     2     2
7     2     2     8

正如您所看到的，行0,1,3,4,6和7都是重复的（使用＆＃39; key1＆＃39;和＆＃39; key2＆＃39;。但是，如果我将其编入索引像这样的DataFrame：

frame[frame.duplicated(['key1','key2'])]

我得到了

   key1  key2  data
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

（即，第1行和第2行没有显示，因为它们没有通过重复方法索引为True）。

这是我的第一个问题。我的第二个问题涉及如何从这些信息中提取描述性统计数据。暂时忘记丢失的副本，让我们说我想为重复的条目计算.min（）和.max（）（这样我就可以得到一个范围）。我可以在groupby对象上使用groupby和这些方法，如下所示：

a.groupby(['key1','key2']).min()

给出了

           key1  key2  data
key1 key2                  
1    2        1     2     6
2    2        2     2     1

我想要的数据显然在这里，但是我提取它的最佳方法是什么？如何索引生成的对象以获得我想要的内容（key1，key2，数据信息）？

Answer 1

编辑 Pandas 0.17 或更高版本

由于 Pandas 0.17 ，take_last方法的duplicated()参数为deprecated支持新的keep参数，请参阅this answer正确的方法：

使用duplicated()调用keep=False方法，即frame.duplicated(['key1', 'key2'], keep=False)。

因此，为了提取此特定问题所需的数据，以下内容足以满足：

In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

有趣的是， Pandas 0.17 中的这种变化可能部分归因于this issue中提到的这个问题。

对于 Pandas 0.17 之前的版本：

我们可以使用duplicated()方法的take_last参数：

take_last：boolean，默认False


对于一组不同的重复行，将除最后一行之外的所有行标记为重复行。除了标记的第一行之外的所有行都是默认值。

如果我们将take_last的值设置为True，我们会标记除最后一个重复行之外的所有值。将此及其默认值False组合在一起，它标记除第一个重复行之外的所有行，允许我们标记所有重复的行：

In [76]: frame.duplicated(['key1', 'key2'])
Out[76]: 
0    False
1    False
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]: 
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]: 
0     True
1     True
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]: 
   key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

[6 rows x 3 columns]

现在我们只需要使用groupby和min方法，我相信输出符合要求的格式：

In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

Answer 2

要获得Pandas版本0.17的所有重复条目的列表，您只需设置＆＃39; keep = False＆＃39;在duplicated函数中。

frame[frame.duplicated(['key1','key2'],keep=False)]

    key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

Answer 3

这是在两列中返回所有重复值的一种可能解决方案（即行0,1,3,4,6,7）：

>>> key1_dups = frame.key1[frame.key1.duplicated()].values
>>> key2_dups = frame.key2[frame.key2.duplicated()].values
>>> frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
   key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

（修改：实际上，df.duplicated(take_last=True) | df.duplicated()方法in @Yoel's answer更整洁。）

要查询groupby操作的结果，您可以使用loc。例如：

>>> dups = frame[frame.key1.isin(key1_dups) & frame.key2.isin(key2_dups)]
>>> grouped = dups.groupby(['key1','key2']).min()
>>> grouped
           data
key1 key2      
1    2        5
2    2        1

>>> grouped.loc[1, 2]
    data    5
Name: (1, 2), dtype: int64

或者，通过重置两个索引，将grouped恢复为“看起来很正常”的DataFrame：

>>> grouped.reset_index(level=0).reset_index(level=0)
   key2  key1  data
0     2     1     5
1     2     2     1

如何分析此Pandas DataFrame中的所有重复条目？

3 个答案: