有没有更快的替代col.drop_duplicates()的方法?

时间:2019-01-15 10:25:58

标签: python-3.x pandas jupyter-notebook drop-duplicates

我正在尝试删除数据框(csv)中的重复数据,并获得一个单独的csv以显示每一列的唯一答案。问题是我的代码已经运行了一天(准确地说是22个小时),我愿意接受其他建议。

我的数据大约有20,000行,带有标题。我曾尝试像df [col] .unique()一样逐一检查唯一列表,并且花费的时间并不长。

>df = pd.read_csv('Surveydata.csv')
>
>df_uni=df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
>
>df_uni.to_csv('Surveydata_unique.csv',index=False)

我期望的是具有相同列集但在每个字段中没有任何重复的数据框。例如如果df ['Rmoisture']组合为Yes,No,Nan,则在另一个数据帧df_uni的同一列中应该只包含这3个。

编辑:这是示例 input output

2 个答案:

答案 0 :(得分:3)

如果列中的值顺序不重要,则将每列转换为set以删除重复项,然后转换为Series并通过concat连接在一起:

df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)

如果订单很重要:

df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)

在2k行中执行1k个唯一值

np.random.seed(2019)

#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)


In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

在2k行中执行100个唯一值

df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)

In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 1 :(得分:2)

另一种方法:

new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)


   Mass     Length  Material  Special Mark  Special Num  Breaking  \
0    4.0   5.500000     Wood            A         20.0      Yes   
1   12.0   2.600000    Steel          NaN          NaN       No   
2    1.0   3.500000   Rubber            B          5.5      NaN   
3   15.0   6.500000  Plastic            X          6.6      NaN   
4    6.0  12.000000      NaN          NaN          5.6      NaN   
5   14.0   2.500000      NaN          NaN          6.3      NaN   
6    2.0  15.000000      NaN          NaN          NaN      NaN   
7    8.0   2.000000      NaN          NaN          NaN      NaN   
8    7.0  10.000000      NaN          NaN          NaN      NaN   
9    9.0   2.200000      NaN          NaN          NaN      NaN   
10  11.0   4.333333      NaN          NaN          NaN      NaN   
11  13.0   4.666667      NaN          NaN          NaN      NaN   
12   NaN   3.750000      NaN          NaN          NaN      NaN   
13   NaN   1.666667      NaN          NaN          NaN      NaN   

                  Comment  
0        There is no heat  
1                     NaN  
2       Contains moisture  
3   Hit the table instead  
4          A sign of wind  
5                     NaN  
6                     NaN  
7                     NaN  
8                     NaN  
9                     NaN  
10                    NaN  
11                    NaN  
12                    NaN  
13                    NaN