计算并删除熊猫数据框中每个唯一行的重复项

时间:2019-05-22 09:03:02

标签: python pandas dataframe

一个数据帧包含150,000多个数据,其中包括重复数据。下面显示的是数据示例,具有25列(包括索引)。我想:

1)计算每个唯一数据的重复次数

2)根据每一行删除所有重复的数据

3)插入新列以显示每个唯一数据的重复次数

,Date,Time,Company,AV_ID,timestamp,Longitude,Latitude,Altitude,Roll,Pitch,Yaw,Roll Rate,Pitch Rate,Yaw Rate,Speed-x,Speed-y,Speed-z,Drive Mode,Throttle Actuator Value,Brake Light Condition,Brake Actuator Value,Steering Angle,Direction Indicator,Reverse Light Condition
0,29-Jan-2019,09:29:43.184,DEL,DEL0002,2019-01-29 09:33:33.425000,,,,,,,,0.0,,,2.22,,,9.25,,,,,
1,29-Jan-2019,09:29:43.184,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
2,29-Jan-2019,09:29:43.199,DEL,DEL0002,2019-01-29 09:33:33.425000,,,,,,,,0.0,,,2.22,,,9.25,,,,,
3,29-Jan-2019,09:29:43.199,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
4,29-Jan-2019,09:29:44.543,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
5,29-Jan-2019,09:29:44.543,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
6,29-Jan-2019,09:29:44.574,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
7,29-Jan-2019,09:29:44.574,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
8,29-Jan-2019,09:29:46.606,DEL,DEL0002,2019-01-29 09:33:37.425000,,,,,,,,0.0,,,2.22,,,5.48,,,,,
9,29-Jan-2019,09:29:46.606,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
10,29-Jan-2019,09:29:46.622,DEL,DEL0002,2019-01-29 09:33:37.425000,,,,,,,,0.0,,,2.22,,,5.48,,,,,
11,29-Jan-2019,09:29:46.622,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
12,29-Jan-2019,09:29:48.573,DEL,DEL0002,2019-01-29 09:33:39.422000,,,,,,,,0.0,,,1.94,,,6.02,,,,,
13,29-Jan-2019,09:29:48.573,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
14,29-Jan-2019,09:29:48.588,DEL,DEL0002,2019-01-29 09:33:39.422000,,,,,,,,0.0,,,1.94,,,6.02,,,,,

到目前为止,我能够按照以下步骤删除重复项。但是,我无法为每个数据的唯一行计算重复次数,也无法将计数插入到新列中。

# To get some time conversion
s = pd.to_numeric(mydataset['timestamp'], errors = 'coerce') + local
mydataset['timestamp'] = pd.to_datetime(s, unit = 'ms')

# To remove the duplicates
duplicatedRows = mydataset[mydataset.duplicated()]

2 个答案:

答案 0 :(得分:0)

您可以尝试groupby所有列,然后按size计数重复项:

df = df.groupby(df.columns.tolist()).size().reset_index(name='Size')

答案 1 :(得分:0)

假设我正确地满足了您的需求,请查看以下数据子集:

4,29-Jan-2019,09:29:44.543,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,
5,29-Jan-2019,09:29:44.543,in,msg:,should,be,20,or,18!,,,,,,,,,,,,,,,
6,29-Jan-2019,09:29:44.574,DEL,DEL0002,2019-01-29 09:33:35.425000,,,,,,,,0.0,,,2.5,,,7.63,,,,,

如果要将这些行的第一行和最后一行视为重复行,则需要指定groupby的列,因为第二列的时间不同(09:29:44.543和09: 29:44.574),所以不会聚在一起。

以您的许多列为例:

cols_to_groupby = ['Company', 'AV_ID', 'timestamp', 'Longitude', 'Latitude', 'Altitude']

# insert a new column with count of duplicates:
df['duplicate_count'] = df.groupby(cols_to_groupby).transform('count')

# get rid of duplicates:
df = df.drop_duplicates(subset=cols_to_groupby)