使用原始索引获取Pandas重复行计数

时间:2016-12-16 09:48:59

标签: python pandas group-by aggregate multiple-columns

我需要在Pandas Dataframe中找到重复的行,然后添加一个带有count的额外列。假设我们有一个数据框:

>>print(df)

+----+-----+-----+-----+-----+-----+-----+-----+-----+
|    |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
|  0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  1 |   2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  2 |   2 |   4 |   3 |   4 |   1 |   1 |   4 |   4 |
|  3 |   4 |   3 |   4 |   0 |   0 |   0 |   0 |   0 |
|  4 |   2 |   3 |   4 |   3 |   4 |   0 |   0 |   0 |
|  5 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  6 |   4 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |
|  7 |   1 |   1 |   4 |   0 |   0 |   0 |   0 |   0 |
|  8 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
|  9 |   4 |   3 |   4 |   0 |   0 |   0 |   0 |   0 |
| 10 |   3 |   3 |   4 |   3 |   5 |   5 |   5 |   0 |
| 11 |   5 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |
| 12 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
| 13 |   0 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |
| 14 |   2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
| 15 |   1 |   3 |   5 |   0 |   0 |   0 |   0 |   0 |
| 16 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
| 17 |   3 |   3 |   4 |   4 |   0 |   0 |   0 |   0 |
| 18 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+

上面的框架将成为下面的框架,并附加一个带有计数的列。您可以看到我们仍然保留索引列。

+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    |   2 |   3 |   4 |   5 |   6 |   7 |   8 |   9 |  10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
|  0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   2 |
|  1 |   2 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   2 |
|  2 |   2 |   4 |   3 |   4 |   1 |   1 |   4 |   4 |   1 |
|  3 |   4 |   3 |   4 |   0 |   0 |   0 |   0 |   0 |   2 |
|  4 |   2 |   3 |   4 |   3 |   4 |   0 |   0 |   0 |   1 |
|  5 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   3 |
|  6 |   4 |   5 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
|  7 |   1 |   1 |   4 |   0 |   0 |   0 |   0 |   0 |   1 |
| 10 |   3 |   3 |   4 |   3 |   5 |   5 |   5 |   0 |   1 |
| 11 |   5 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
| 13 |   0 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
| 15 |   1 |   3 |   5 |   0 |   0 |   0 |   0 |   0 |   1 |
| 16 |   4 |   0 |   0 |   0 |   0 |   0 |   0 |   0 |   1 |
| 17 |   3 |   3 |   4 |   4 |   0 |   0 |   0 |   0 |   1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

我已经看到了其他解决方案,例如:

 df.groupby(list(df.columns.values)).size()

但是这会返回一个带有间隙且没有初始索引的矩阵。

1 个答案:

答案 0 :(得分:4)

您可以先使用reset_index ALTER PROCEDURE [dbo].spGetVendorbyFilter @PageNumber INT, @PageSize INT, @city VARCHAR(200), @area VARCHAR(200), @vendortype VARCHAR(200) AS BEGIN DECLARE @StartRow INT DECLARE @EndRow INT SET @StartRow = ( ( @PageNumber - 1 ) * @PageSize ) + 1; SET @EndRow= @PageNumber * @PageSize; WITH Result AS ( SELECT *, Row_number() OVER ( ORDER BY VendorID ASC) RowNumber FROM tblVendor ) IF (@city IS NOT NULL AND @area IS NULL AND @vendortype IS NULL) SELECT * FROM Result where City=@city AND RowNumber BETWEEN @StartRow and @EndRow ELSE IF (@city IS NULL AND @area IS NOT NULL AND @vendortype IS NULL) SELECT * FROM Result where Area=@area AND RowNumber BETWEEN @StartRow and @EndRow ELSE IF (@city IS NULL AND @area IS NULL AND @vendortype IS NOT NULL) SELECT * FROM Result where Category=@vendortype AND RowNumber BETWEEN @StartRow and @EndRow ELSE IF (@city IS NOT NULL AND @area IS NOT NULL AND @vendortype IS NULL) SELECT * FROM Result where City=@city And Area=@area AND RowNumber BETWEEN @StartRow and @EndRow ELSE IF (@city IS NOT NULL AND @area IS NULL AND @vendortype IS NOT NULL) SELECT * FROM Result where City=@city And Category=@vendortype AND RowNumber BETWEEN @StartRow and @EndRow ELSE IF (@city IS NULL AND @area IS NOT NULL AND @vendortype IS NOT NULL) SELECT * FROM Result where Area=@area And Category=@vendortype AND RowNumber BETWEEN @StartRow and @EndRow ELSE SELECT * FROM Result WHERE RowNumber BETWEEN @StartRow and @EndRow END 转换为列,然后index firstlen使用aggregate

此外,如果需要按所有列分组,请按difference删除index列:

print (df.columns.difference(['index']))
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')

print (df.reset_index()
         .groupby(df.columns.difference(['index']).tolist())['index']
         .agg(['first', 'size'])
         .reset_index()
         .set_index(['first'])
         .sort_index()
         .rename_axis(None))

    2  3  4  5  6  7  8  9  size
0   0  0  0  0  0  0  0  0     2
1   2  0  0  0  0  0  0  0     2
2   2  4  3  4  1  1  4  4     1
3   4  3  4  0  0  0  0  0     2
4   2  3  4  3  4  0  0  0     1
5   5  0  0  0  0  0  0  0     3
6   4  5  0  0  0  0  0  0     1
7   1  1  4  0  0  0  0  0     1
10  3  3  4  3  5  5  5  0     1
11  5  4  0  0  0  0  0  0     1
13  0  4  0  0  0  0  0  0     1
15  1  3  5  0  0  0  0  0     1
16  4  0  0  0  0  0  0  0     1
17  3  3  4  4  0  0  0  0     1

如有必要,请添加下一栏10需要rename

#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10

print (df.reset_index()
        .groupby(df.columns.difference(['index']).tolist())['index']
        .agg(['first', 'size'])
        .reset_index()
        .set_index(['first'])
        .sort_index()
        .rename_axis(None)
        .rename(columns={'size':last_col}))

    2  3  4  5  6  7  8  9  10
0   0  0  0  0  0  0  0  0   2
1   2  0  0  0  0  0  0  0   2
2   2  4  3  4  1  1  4  4   1
3   4  3  4  0  0  0  0  0   2
4   2  3  4  3  4  0  0  0   1
5   5  0  0  0  0  0  0  0   3
6   4  5  0  0  0  0  0  0   1
7   1  1  4  0  0  0  0  0   1
10  3  3  4  3  5  5  5  0   1
11  5  4  0  0  0  0  0  0   1
13  0  4  0  0  0  0  0  0   1
15  1  3  5  0  0  0  0  0   1
16  4  0  0  0  0  0  0  0   1
17  3  3  4  4  0  0  0  0   1