我必须得到在我的数据框中重复显示完整一行的次数,然后才显示那些出现重复的行,并在最后一列显示这些行重复出现的次数。
用于创建输出正确表的输入值:
dur,wage1,wage2,wage3,cola,hours,pension,stby_pay,shift_diff,educ_allw,holidays,vacation,ldisab,dntl,ber,hplan,agr
2,4.5,4.0,?,?,40,?,?,2,no,10,below average,no,half,?,half,bad
2,2.0,2.0,?,none,40,none,?,?,no,11,average,yes,none,yes,full,bad
3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good
1,2.0,?,?,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad
1,6.0,?,?,?,38,?,8,3,?,9,generous,?,?,?,?,good
2,2.5,3.0,?,tcf,40,none,?,?,?,11,below average,?,?,yes,?,bad
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below average,yes,half,?,none,bad
1,2.8,?,?,none,38,empl_contr,2,3,no,9,below average,yes,half,?,none,bad
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.3,4.4,?,?,38,?,?,4,?,12,generous,?,full,?,full,good
1,2.8,?,?,?,35,?,?,2,?,12,below average,?,?,?,?,good
2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good
1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good
2,4.5,4.0,?,none,40,?,?,4,?,12,average,yes,full,yes,half,good
3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good
3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good
3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good
我必须保持那些完全相同的行。
这是表结果:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
正如您在此表中看到的那样,我们保留例如索引为6的行,因为在输入表的第6行和第17行读取,两行都是相同的。
使用我当前的代码:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
x = data[data.duplicated(keep=False)].drop_duplicates()
return x
我正确地得到了结果,但是我不知道如何计算重复的行,然后将其添加到列'nums_rep'在表的末尾。
这是我的结果,没有计算重复行数的最后一列:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
如何根据列中所有数据的相等性执行正确的计数,然后将其添加并显示在列' num_reps'?
中