Question

我正在尝试计算数据框中每种类型行的重复项。例如，假设我在pandas中有一个数据帧，如下所示：

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
                   'two': pd.Series([1., 2., 1])})

我得到的df看起来像这样：

我想第一步是找到所有不同的唯一行，我这样做：

df.drop_duplicates()

这给了我以下df：

    one two
0   1   1
1   1   2

现在我想从上面的df（[1 1]和[1 2]）中获取每一行，并计算每个在初始df中的次数。我的结果看起来像这样：

Row     Count
[1 1]     2
[1 2]     1

我应该怎样做最后一步？

编辑：

以下是一个更明确的例子：

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})

给了我：

    one three   two
0   True    True    True
1   True    False   False
2   True    False   False
3   False   False   True

我想要一个告诉我的结果：

       Row           Count
[True True True]       1
[True False False]     2
[False False True]     1

Answer 1

您可以在所有列上groupby并致电size索引指示重复值：

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64

Answer 2

df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

Answer 3

如果您希望在特定列上计算重复项：

len(df['one'])-len(df['one'].drop_duplicates())

如果要在整个数据帧上计算重复项：

len(df)-len(df.drop_duplicates())

或者您也可以使用DataFrame.duplicated(subset=None, keep='first')：

df.duplicated(subset='one', keep='first').sum()

其中

子集：列标签或标签序列（默认情况下使用所有列）

保持：{'first'，'last'，False}，默认为'first'

first ：将重复项标记为True（除了第一次出现）。
last ：将最后一次出现的重复项标记为True。
False ：将所有重复项标记为True。

Answer 4

df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

产地：

>>> df1
       Count
1 1 1      2
3 2 3      1
1 2 2      1

如果索引值必须是列表，则可以使用上述代码：

df1.index = df1.index.str.split()

产地：

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1

Answer 5

现有答案中没有一个提供了一个简单的解决方案，该解决方案返回“只是重复且应删掉的行数”。这是一种千篇一律的解决方案，可以做到：

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

Answer 6

我使用：

used_features =[
    "one",
    "two",
    "three"
]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

给出重复行的计数，然后您可以通过一个新列对其进行分析。我在这里没有看到这样的解决方案。

Answer 7

今天遇到了这个问题，想要包含NaN，所以我暂时将它们替换为“”（空字符串）。如果您不明白，请发表评论:)。该解决方案假定“”对您而言不是一个相关值。它也应该适用于数值数据（我已经成功地对其进行了测试，但没有进行广泛的测试），因为在将np.nan替换为“”后，熊猫会再次推断出数据类型。

import pandas as pd

# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
                  'test2':['bar',None,None,None,'bar'],
                  'test3':[None, 'foo','bar',None,None]})

# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
      .groupby(df.columns.tolist()).apply(len)\
      .rename('group_count')\
      .reset_index()\
      .replace('',np.nan)\
      .sort_values(by = ['group_count'], ascending = False))
df

  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

Answer 8

在 Pandas 1.1.0 中，您可以使用 fun main() { val amount = mutableListOf<Int>() //Mutable list to store the Ints println("Add an amount to the list: ") val money = readLine()!!.toInt() //Requesting for an IntInput amount.add(6000) //Adding Ints to the list amount.add(8000) //Adding Ints to the list amount.add("$money") **LINE WITH THE ERROR IS THIS ONE WHEN PASSING THE MONEY VARIBALE THAT HOLDING THE INT INPUT** println("$amount") //Printing the list items } 方法：

value_counts

输出：

df = pd.DataFrame({'A': [1, 1, 1], 'B': [1, 2, 1]})
df.value_counts()

或

A  B
1  1    2
   2    1
dtype: int64

输出：

df.value_counts().reset_index(name='counts')

如何计算pandas数据帧中的重复行？

8 个答案: