我有以下DataFrame:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
我要做的是计算完全相同的行,包括NaN值。
问题如下,我使用groupby,但它是一个忽略NaN值的函数,也就是说,在进行计数时它没有记住它们,这就是我没有返回正确的原因输出计算这些行之间的精确重复次数。
我的代码如下:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
#This line should count my repeated rows
s = aux.groupby(data.columns.tolist(),as_index=False).transform('size')
return x
如果我打印“x”var,我得到这个结果,它会显示所有重复的行:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0
47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
51 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN
53 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN
现在我必须从x结果中计算完全相同的那些行。
这应该是我的正确输出:
dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps
6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4
8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2
9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3
43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2
这是我的问题,而且groupby忽略了NaN值,这就是为什么关于这个问题的其他类似帖子无法帮助我。
由于
答案 0 :(得分:0)
如果数据框的名称是df,则只需使用一行代码就可以计算重复数:
<android.support.constraint.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto"
xmlns:tools="http://schemas.android.com/tools"
android:id="@+id/frameLayout"
android:layout_width="match_parent"
android:layout_height="match_parent"
tools:context="com.world.bolandian.talent.fragments.AddVideoAudioFragment">
<TextView
android:id="@+id/textView"
android:layout_width="0dp"
android:layout_height="wrap_content"
android:layout_marginEnd="8dp"
android:layout_marginTop="32dp"
android:gravity="center"
android:text="UPLOAD YOUR MUSIC"
android:textAppearance="@style/TextAppearance.AppCompat.Display1"
android:textStyle="bold"
app:layout_constraintEnd_toEndOf="parent"
app:layout_constraintStart_toStartOf="parent"
app:layout_constraintTop_toTopOf="parent" />
<Spinner
android:id="@+id/spinner"
android:layout_width="227dp"
android:layout_height="19dp"
android:layout_marginStart="8dp"
android:entries="@color/bootstrap_brand_info"
android:popupBackground="@color/bootstrap_brand_success"
android:spinnerMode="dialog"
app:layout_constraintBottom_toBottomOf="@+id/textView2"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintStart_toEndOf="@+id/textView2"
app:layout_constraintTop_toTopOf="@+id/textView2" />
<TextView
android:id="@+id/textView2"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_marginLeft="16dp"
android:layout_marginTop="68dp"
android:text="Choose Genre"
android:textSize="18dp"
android:textAppearance="@style/TextAppearance.AppCompat.Body2"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintTop_toBottomOf="@+id/textView" />
<android.support.design.widget.TextInputLayout
android:id="@+id/textInPut"
android:layout_width="368dp"
android:layout_height="wrap_content"
android:layout_marginTop="45dp"
app:layout_constraintLeft_toLeftOf="parent"
app:layout_constraintRight_toRightOf="parent"
app:layout_constraintTop_toBottomOf="@+id/spinner">
<android.support.design.widget.TextInputEditText
android:id="@+id/etTitleMusic"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:hint="Title your upload" />
</android.support.design.widget.TextInputLayout>
<com.beardedhen.androidbootstrap.BootstrapButton
android:id="@+id/btnUploadVideo"
android:layout_width="0dp"
android:layout_height="wrap_content"
android:layout_marginEnd="8dp"
android:layout_marginStart="8dp"
android:layout_marginTop="56dp"
android:text="UPLOAD VIDEO"
app:bootstrapBrand="primary"
app:bootstrapSize="lg"
app:buttonMode="regular"
app:layout_constraintEnd_toEndOf="parent"
app:layout_constraintStart_toStartOf="parent"
app:layout_constraintTop_toBottomOf="@+id/textInPut"
app:roundedCorners="true"
app:showOutline="false" />
<com.beardedhen.androidbootstrap.BootstrapButton
android:id="@+id/btnUploadAudio"
android:layout_width="0dp"
android:layout_height="wrap_content"
android:layout_marginEnd="8dp"
android:layout_marginStart="8dp"
android:layout_marginTop="24dp"
android:text="UPLOAD AUDIO"
app:bootstrapBrand="warning"
app:bootstrapSize="lg"
app:buttonMode="regular"
app:layout_constraintEnd_toEndOf="parent"
app:layout_constraintStart_toStartOf="parent"
app:layout_constraintTop_toBottomOf="@+id/btnUploadVideo"
app:roundedCorners="true"
app:showOutline="false" />
如果要删除重复行,请使用drop_duplicates方法。 documentation
示例:
sum(df.duplicated(keep = False))
导入data.csv并删除重复行(默认保留重复行的第一个实例)
#data.csv
col1,col2,col3
a,3,NaN #duplicate
b,9,4 #duplicate
c,12,5
a,3,NaN #duplicate
b,9,4 #duplicate
d,19,20
a,3,NaN #duplicate - 5 duplicate rows
要计算重复行数,请使用数据框的重复方法。将“keep”设置为False(documentation)。如上所述,您只需使用import pandas as pd
df = pd.read_csv("data.csv")
print(df.drop_duplicates())
#Output
c1 c2 c3
0 a 3 NaN
1 b 9 4.0
2 c 12 5.0
5 d 19 20.0
执行此操作即可。这是一种更简单的方法来演示“重复”方法的作用:
sum(df.duplicated(keep = False))
答案 1 :(得分:0)
我刚解决了。
我所说的问题是groupby并不接受Nan值。
所以我所做的就是用fillna(0)函数改变所有Nan值,所以它将所有NaN都改为0,现在我可以正确地进行比较了。
这是我的新功能正常工作:
def detect_duplicates(data):
x = DataFrame(columns=data.columns.tolist() + ["num_reps"])
aux = data[data.duplicated(keep=False)]
x = data[data.duplicated(keep=False)].drop_duplicates()
s = aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'})
x['num_reps'] = s['count'].tolist()[::-1]
return x