我正在尝试删除数据框(csv)中的重复数据,并获得一个单独的csv以显示每一列的唯一答案。问题是我的代码已经运行了一天(准确地说是22个小时),我愿意接受其他建议。
我的数据大约有20,000行,带有标题。我曾尝试像df [col] .unique()一样逐一检查唯一列表,并且花费的时间并不长。
>df = pd.read_csv('Surveydata.csv')
>
>df_uni=df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
>
>df_uni.to_csv('Surveydata_unique.csv',index=False)
我期望的是具有相同列集但在每个字段中没有任何重复的数据框。例如如果df ['Rmoisture']组合为Yes,No,Nan,则在另一个数据帧df_uni的同一列中应该只包含这3个。
答案 0 :(得分:3)
如果列中的值顺序不重要,则将每列转换为set
以删除重复项,然后转换为Series
并通过concat
连接在一起:
df1 = pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
如果订单很重要:
df1 = pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
在2k行中执行1k个唯一值:
np.random.seed(2019)
#2k rows
df = pd.DataFrame(np.random.randint(1000, size=(20, 2000))).astype(str)
In [151]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.07 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [152]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
323 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [153]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
430 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
在2k行中执行100个唯一值
df = pd.DataFrame(np.random.randint(100, size=(20, 2000))).astype(str)
In [155]: %timeit df.apply(lambda col: col.drop_duplicates().reset_index(drop=True))
1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [156]: %timeit pd.concat({k: pd.Series(list(set(v))) for k, v in df.to_dict('l').items()}, axis=1)
544 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [157]: %timeit pd.concat({col: pd.Series(df[col].unique()) for col in df.columns}, axis=1)
654 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:2)
另一种方法:
new_df = []
[new_df.append(pd.DataFrame(df[i].unique(), columns=[i])) for i in df.columns]
new_df = pd.concat(new_df,axis=1)
print(new_df)
Mass Length Material Special Mark Special Num Breaking \
0 4.0 5.500000 Wood A 20.0 Yes
1 12.0 2.600000 Steel NaN NaN No
2 1.0 3.500000 Rubber B 5.5 NaN
3 15.0 6.500000 Plastic X 6.6 NaN
4 6.0 12.000000 NaN NaN 5.6 NaN
5 14.0 2.500000 NaN NaN 6.3 NaN
6 2.0 15.000000 NaN NaN NaN NaN
7 8.0 2.000000 NaN NaN NaN NaN
8 7.0 10.000000 NaN NaN NaN NaN
9 9.0 2.200000 NaN NaN NaN NaN
10 11.0 4.333333 NaN NaN NaN NaN
11 13.0 4.666667 NaN NaN NaN NaN
12 NaN 3.750000 NaN NaN NaN NaN
13 NaN 1.666667 NaN NaN NaN NaN
Comment
0 There is no heat
1 NaN
2 Contains moisture
3 Hit the table instead
4 A sign of wind
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN