在pandas数据帧中组合两行,在多列中具有相同的值并比较另一列中的数据

时间:2017-07-09 16:43:05

标签: pandas

我对python pandas很新。 我有一个10k +行的排序pandas数据框。 以下是示例数据框:

示例:

             0         1                  2       3        4             5

Hour:12 Min:31 Sec:24 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:33 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:30 Sec:31 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:33 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337         
Hour:12 Min:31 Sec:19 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337      
Hour:12 Min:32 Sec:22 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337     
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337     
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337          
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337          
Hour:12 Min:32 Sec:17 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337           
Hour:12 Min:30 Sec:24 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337            
Hour:12 Min:32 Sec:46 Ms    E_ID:459   Name:I_SECONDROW UE_C:9   M_ID:93  C_ID_1:20337          
Hour:12 Min:30 Sec:24 Ms    E_ID:500   Name:I_SECONDROW UE_C:1   M_ID:80  C_ID_1:20110         
Hour:12 Min:30 Sec:26 Ms    E_ID:500   Name:M_FIRSTROWW UE_C:1   M_ID:80  C_ID_1:20110      

现在我想将2行(对)和NAME组合为M_FIRSTROWW& I_SECONDROW并在第1,3,4,5栏中具有相同的数据。

选定的对应该只有小于或等于5秒的时差。

预期输出:

Hour:12 Min:30 Sec:24 Ms    E_ID:500   Name:I_SECONDROW UE_C:1   M_ID:80  C_ID_1:20110         
Hour:12 Min:30 Sec:26 Ms    E_ID:500   Name:M_FIRSTROWW UE_C:1   M_ID:80  C_ID_1:20110


Hour:12 Min:30 Sec:31 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337 
Hour:12 Min:30 Sec:26 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337 


Hour:12 Min:32 Sec:22 Ms    E_ID:459   Name:M_FIRSTROWW UE_C:10  M_ID:93  C_ID_1:20337
Hour:12 Min:32 Sec:17 Ms    E_ID:459   Name:I_SECONDROW UE_C:10  M_ID:93  C_ID_1:20337  

1 个答案:

答案 0 :(得分:0)

1)一些有用的导入

import pandas as pd
import numpy as np
import datetime as dt
import itertools
import re

2)导入和清理数据

df = pd.read_csv("data.csv", sep="|", header=None, names=["time", "mseid", "name", "uec", "mid", "cid"])
df["time"] = [dt.datetime.strptime(":".join(re.findall(r'\d+', time_string)), "%H:%M:%S") for time_string in df["time"]]
df["mseid"] = [mseid.split(":")[-1] for mseid in df["mseid"]]
df["name"] = [name.split(":")[-1] for name in df["name"]]
df["uec"] = [uec.split(":")[-1] for uec in df["uec"]]
df["mid"] = [mid.split(":")[-1] for mid in df["mid"]]
df["cid"] = [cid.split(":")[-1] for cid in df["cid"]]

3)按时间和行名称,按行名称排序列表,并提取这些组的索引。然后我们可以将这些索引压缩以配对FIRSTROW s SECONDROW s

df_sorted = df.sort_values(["name", "time"]).groupby("name").groups.values()
>>> dict_values([Int64Index([10, 12, 6, 7, 8, 4, 0, 9, 11], dtype='int64'), Int64Index([13, 2, 5, 1, 3], dtype='int64')])

# https://stackoverflow.com/questions/12355442/converting-a-list-of-tuples-into-a-simple-flat-list
ordered = list(itertools.chain(*zip(*df_sorted)))
num_groups = int(len(ordered) / 2)
ordered += [ind for ind in df.index if ind not in ordered]
ordered
>>> [10, 13, 12, 2, 6, 5, 7, 1, 8, 3, 0, 4, 9, 11]


df = df.iloc[ordered]
df = df.reset_index()
del df['index']
df.head()

>>>     time    mseid   name    uec mid cid
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110
2   1900-01-01 12:30:24 500 I_SECONDROW 1   80  20110
3   1900-01-01 12:30:31 459 M_FIRSTROWW 10  93  20337
4   1900-01-01 12:30:26 459 I_SECONDROW 10  93  203377

4)为行配对创建并添加一列

groups = [val for val in range(num_groups) for _ in [0, 1]]
remainder = len(df.index) - len(groups)
groups = groups + ["-" for i in range(remainder)]
df["pair"] = groups
groups

>>> [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, '-', '-', '-', '-']

5)组行对,找时间差并为time_delta添加列

pairs = df.groupby("pair")["time"]
time_delta = []
for pair in pairs:
    if len(pair[1]) == 2:
        second, first = pair[1].values
        time_difference = abs(int((first - second)/1000000000)) # nanoseconds to seconds
        time_delta.append(time_difference)
time_delta = [val for val in time_delta for _ in [0, 1]]
remainder = len(df.index) - len(time_delta)
time_delta = time_delta + [np.NaN for i in range(remainder)]
df["time_delta"] = time_delta
df

>>>     time    mseid   name    uec mid cid pair    time_delta
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337   0   2.0
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110   0   2.0
2   1900-01-01 12:30:24 500 I_SECONDROW 1   80  20110   1   7.0
3   1900-01-01 12:30:31 459 M_FIRSTROWW 10  93  20337   1   7.0
4   1900-01-01 12:30:26 459 I_SECONDROW 10  93  20337   2   116.0

6)最后,创建一个布尔掩码以使所有time_delta s< = 5,然后groupby('pair')

df[df.time_delta <=5].head().groupby("pair").head()

    time    mseid   name    uec mid cid pair    time_delta
0   1900-01-01 12:30:24 459 I_SECONDROW 10  93  20337   0   2.0
1   1900-01-01 12:30:26 500 M_FIRSTROWW 1   80  20110   0   2.0

注意时间戳默认为1900年,这是无关紧要的,因为我们在同一天减去时间。但是,在创建数据时应使用准确的时间戳。