如何比较python中数据帧的行是否相等

时间:2016-05-12 10:04:46

标签: python pandas dataframe

1 0 0 0 1
0 0 0 0 0
0 1 0 0 1
1 0 0 0 1
0 0 0 0 0
1 0 0 0 1

我有一个数据框(见上文)。我需要比较它的行来获得匹配的行。所以对于上面的df我应该在比较后获得row1 = row4 = row6和row2 = row5。有没有有效的方法在python中进行这种行比较。

2 个答案:

答案 0 :(得分:3)

使用:

import pandas as pd


df = pd.DataFrame({0: {0: 1, 1: 0, 2: 0, 3: 1, 4: 0, 5: 1}, 
                   1: {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 
                   2: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   3: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   4: {0: 1, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1}})
print df
   0  1  2  3  4
0  1  0  0  0  1
1  0  0  0  0  0
2  0  1  0  0  1
3  1  0  0  0  1
4  0  0  0  0  0
5  1  0  0  0  1
#first select only all duplicated rows
df1 = df[df.duplicated(keep=False)]
print df1
   0  1  2  3  4
0  1  0  0  0  1
1  0  0  0  0  0
3  1  0  0  0  1
4  0  0  0  0  0
5  1  0  0  0  1

#sort values by all columns
df2 = df1.sort_values(by=df.columns.tolist())
print df2
   0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1

#find groups
print (~((df2 == df2.shift(1)).all(1))).cumsum()
1    1
4    1
0    2
3    2
5    2
dtype: int32
#print groups    
for i, g in df.groupby((~((df2 == df2.shift(1)).all(1))).cumsum()):
    print g

   0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0
   0  1  2  3  4
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1

#dict comprehension for storing groups
dfs = {i-1: g for i,g in df.groupby((~((df2 == df2.shift(1)).all(1))).cumsum())}
print dfs
{0.0:    0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0, 1.0:    0  1  2  3  4
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1}

print dfs[0]
   0  1  2  3  4
1  0  0  0  0  0
4  0  0  0  0  0

print dfs[1]
   0  1  2  3  4
0  1  0  0  0  1
3  1  0  0  0  1
5  1  0  0  0  1

答案 1 :(得分:1)

以下是我的想法。

import pandas as pd


df = pd.DataFrame({0: {0: 1, 1: 0, 2: 0, 3: 1, 4: 0, 5: 1}, 
                   1: {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}, 
                   2: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   3: {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0}, 
                   4: {0: 1, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1}})

groups = df.groupby(df.columns.tolist())
df.loc[:, 'group_num'] = None


for num, group in enumerate(groups):
    df.loc[group[1].index, 'group_num'] = num

...产量

   0  1  2  3  4 group_num
0  1  0  0  0  1         2
1  0  0  0  0  0         0
2  0  1  0  0  1         1
3  1  0  0  0  1         2
4  0  0  0  0  0         0
5  1  0  0  0  1         2

为什么在最后一行分组[1]?

因为您正在遍历表单的元组(group_name,group_table)。 group [1]访问实际分组的DataFrame。