我需要在Pandas Dataframe中找到重复的行,然后添加一个带有count的额外列。假设我们有一个数据框:
>>print(df)
+----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 |
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+
上面的框架将成为下面的框架,并附加一个带有计数的列。您可以看到我们仍然保留索引列。
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
我已经看到了其他解决方案,例如:
df.groupby(list(df.columns.values)).size()
但是这会返回一个带有间隙且没有初始索引的矩阵。
答案 0 :(得分:4)
您可以先使用reset_index
将 ALTER PROCEDURE [dbo].spGetVendorbyFilter
@PageNumber INT,
@PageSize INT,
@city VARCHAR(200),
@area VARCHAR(200),
@vendortype VARCHAR(200)
AS
BEGIN
DECLARE @StartRow INT
DECLARE @EndRow INT
SET @StartRow = ( ( @PageNumber - 1 ) * @PageSize ) + 1;
SET @EndRow= @PageNumber * @PageSize;
WITH Result
AS (
SELECT *,
Row_number()
OVER (
ORDER BY VendorID ASC) RowNumber
FROM tblVendor
)
IF (@city IS NOT NULL AND @area IS NULL AND @vendortype IS NULL)
SELECT *
FROM Result where City=@city AND RowNumber BETWEEN @StartRow and @EndRow
ELSE IF (@city IS NULL AND @area IS NOT NULL AND @vendortype IS NULL)
SELECT *
FROM Result where Area=@area AND RowNumber BETWEEN @StartRow and @EndRow
ELSE IF (@city IS NULL AND @area IS NULL AND @vendortype IS NOT NULL)
SELECT *
FROM Result where Category=@vendortype AND RowNumber BETWEEN @StartRow and @EndRow
ELSE IF (@city IS NOT NULL AND @area IS NOT NULL AND @vendortype IS NULL)
SELECT *
FROM Result where City=@city And Area=@area AND RowNumber BETWEEN @StartRow and @EndRow
ELSE IF (@city IS NOT NULL AND @area IS NULL AND @vendortype IS NOT NULL)
SELECT *
FROM Result where City=@city And Category=@vendortype AND RowNumber BETWEEN @StartRow and @EndRow
ELSE IF (@city IS NULL AND @area IS NOT NULL AND @vendortype IS NOT NULL)
SELECT *
FROM Result where Area=@area And Category=@vendortype AND RowNumber BETWEEN @StartRow and @EndRow
ELSE
SELECT *
FROM Result WHERE RowNumber BETWEEN @StartRow and @EndRow
END
转换为列,然后index
first
和len
使用aggregate
:
此外,如果需要按所有列分组,请按difference
删除index
列:
print (df.columns.difference(['index']))
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None))
2 3 4 5 6 7 8 9 size
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
如有必要,请添加下一栏10
需要rename
:
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1