python数据框基于列删除重复项

时间:2019-09-19 17:19:53

标签: python pandas pandas-groupby

我有以下数据集,包括刷卡和刷卡时间。刷卡输出的内容必须是唯一的卡和日期,即一天中刷卡多次,则输出应包含一张卡并且是第一次扫描。 。任何有关如何开始使用Python和Pandas的指针都表示赞赏。

Card No     Time 
3434    9/17/2018 5:19
3434    9/17/2018 5:57
3456    9/17/2018 5:58
3457    9/17/2018 5:59
3234    9/17/2018 6:00
3457    9/17/2018 6:07
3459    9/17/2018 6:20
3434    9/20/2018 9:35
3434    9/20/2018 9:35
3456    9/20/2018 9:41
3457    9/20/2018 9:41
3234    9/20/2018 9:43
3457    9/20/2018 9:46
3459    9/20/2018 9:46
3434    9/20/2018 9:51
3434    9/20/2018 9:52
3456    9/20/2018 9:52

Output :
Card No    Time
3434    9/17/2018
3456    9/17/2018
3457    9/17/2018
3234    9/17/2018
3459    9/17/2018
3434    9/20/2018
3456    9/20/2018
3457    9/20/2018
3234    9/20/2018
3459    9/20/2018

3 个答案:

答案 0 :(得分:2)

filename,text filename001.html,this text contains the phrase business class filename002.html,this text is about business class filename003.html,this text is about business classes and economy classes groupby()上尝试card,然后使用date提取所需的行:

idxmin

输出

df.loc[df.groupby(['Card No', df['Time'].dt.date]).Time.idxmin()]

您也可以使用 Card No Time 4 3234 2018-09-17 06:00:00 11 3234 2018-09-20 09:43:00 0 3434 2018-09-17 05:19:00 7 3434 2018-09-20 09:35:00 2 3456 2018-09-17 05:58:00 9 3456 2018-09-20 09:41:00 3 3457 2018-09-17 05:59:00 10 3457 2018-09-20 09:41:00 6 3459 2018-09-17 06:20:00 13 3459 2018-09-20 09:46:00 ,但首先需要创建日期:

drop_duplicates

输出:

df['date'] = df['Time'].dt.date
df.drop_duplicates(['Card No', 'date'])

答案 1 :(得分:1)

假设您的Time列已按照示例中的Time进行排序,如果您希望输出没有时间部分,则可以尝试以下操作

(df.groupby(['Card No', df.Time.dt.date], sort=False).nth(0).drop('Time', 1)
   .reset_index())

Out[30]:
   Card No        Time
0    3434  2018-09-17
1    3456  2018-09-17
2    3457  2018-09-17
3    3234  2018-09-17
4    3459  2018-09-17
5    3434  2018-09-20
6    3456  2018-09-20
7    3457  2018-09-20
8    3234  2018-09-20
9    3459  2018-09-20

否则,您可以尝试groupbyhead

df.groupby(['Card No', df.Time.dt.date], sort=False).head(1)

Out[41]:
    Card No                Time
0     3434 2018-09-17 05:19:00
2     3456 2018-09-17 05:58:00
3     3457 2018-09-17 05:59:00
4     3234 2018-09-17 06:00:00
6     3459 2018-09-17 06:20:00
7     3434 2018-09-20 09:35:00
9     3456 2018-09-20 09:41:00
10    3457 2018-09-20 09:41:00
11    3234 2018-09-20 09:43:00
13    3459 2018-09-20 09:46:00

答案 2 :(得分:0)


s= """3434    9/17/2018 5:19
3434    9/17/2018 5:57
3456    9/17/2018 5:58
3457    9/17/2018 5:59
3234    9/17/2018 6:00
3457    9/17/2018 6:07
3459    9/17/2018 6:20
3434    9/20/2018 9:35
3434    9/20/2018 9:35
3456    9/20/2018 9:41
3457    9/20/2018 9:41
3234    9/20/2018 9:43
3457    9/20/2018 9:46
3459    9/20/2018 9:46
3434    9/20/2018 9:51
3434    9/20/2018 9:52
3456    9/20/2018 9:52"""

raw = [row.split("    ") for row in s.split("\n")]



df = pd.DataFrame(raw, columns=["card", "time"])
df["time"] = pd.to_datetime(df.time)
df["date"] = df["time"].dt.date


## this will also keep time columns with minimum time
df.groupby(["card", "date"]).min().reset_index(level=1)