我对python pandas很新。 我有一个10k +行的排序pandas数据框。 以下是示例数据框:
示例:
0 1 2 3 4 5
Hour:12 Min:31 Sec:24 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:33 Ms E_ID:459 Name:M_FIRSTROWW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:31 Ms E_ID:459 Name:M_FIRSTROWW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:33 Ms E_ID:459 Name:M_FIRSTROWW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:31 Sec:19 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:22 Ms E_ID:459 Name:M_FIRSTROWW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:26 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:26 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:26 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:17 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:24 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:46 Ms E_ID:459 Name:I_SECONDROW UE_C:9 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:24 Ms E_ID:500 Name:I_SECONDROW UE_C:1 M_ID:80 C_ID_1:20110
Hour:12 Min:30 Sec:26 Ms E_ID:500 Name:M_FIRSTROWW UE_C:1 M_ID:80 C_ID_1:20110
现在我想将2行(对)和NAME组合为M_FIRSTROWW& I_SECONDROW并在第1,3,4,5栏中具有相同的数据。
选定的对应该只有小于或等于5秒的时差。
预期输出:
Hour:12 Min:30 Sec:24 Ms E_ID:500 Name:I_SECONDROW UE_C:1 M_ID:80 C_ID_1:20110
Hour:12 Min:30 Sec:26 Ms E_ID:500 Name:M_FIRSTROWW UE_C:1 M_ID:80 C_ID_1:20110
Hour:12 Min:30 Sec:31 Ms E_ID:459 Name:M_FIRSTROWW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:30 Sec:26 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:22 Ms E_ID:459 Name:M_FIRSTROWW UE_C:10 M_ID:93 C_ID_1:20337
Hour:12 Min:32 Sec:17 Ms E_ID:459 Name:I_SECONDROW UE_C:10 M_ID:93 C_ID_1:20337
答案 0 :(得分:0)
1)一些有用的导入
import pandas as pd
import numpy as np
import datetime as dt
import itertools
import re
2)导入和清理数据
df = pd.read_csv("data.csv", sep="|", header=None, names=["time", "mseid", "name", "uec", "mid", "cid"])
df["time"] = [dt.datetime.strptime(":".join(re.findall(r'\d+', time_string)), "%H:%M:%S") for time_string in df["time"]]
df["mseid"] = [mseid.split(":")[-1] for mseid in df["mseid"]]
df["name"] = [name.split(":")[-1] for name in df["name"]]
df["uec"] = [uec.split(":")[-1] for uec in df["uec"]]
df["mid"] = [mid.split(":")[-1] for mid in df["mid"]]
df["cid"] = [cid.split(":")[-1] for cid in df["cid"]]
3)按时间和行名称,按行名称排序列表,并提取这些组的索引。然后我们可以将这些索引压缩以配对FIRSTROW
s SECONDROW
s
df_sorted = df.sort_values(["name", "time"]).groupby("name").groups.values()
>>> dict_values([Int64Index([10, 12, 6, 7, 8, 4, 0, 9, 11], dtype='int64'), Int64Index([13, 2, 5, 1, 3], dtype='int64')])
# https://stackoverflow.com/questions/12355442/converting-a-list-of-tuples-into-a-simple-flat-list
ordered = list(itertools.chain(*zip(*df_sorted)))
num_groups = int(len(ordered) / 2)
ordered += [ind for ind in df.index if ind not in ordered]
ordered
>>> [10, 13, 12, 2, 6, 5, 7, 1, 8, 3, 0, 4, 9, 11]
df = df.iloc[ordered]
df = df.reset_index()
del df['index']
df.head()
>>> time mseid name uec mid cid
0 1900-01-01 12:30:24 459 I_SECONDROW 10 93 20337
1 1900-01-01 12:30:26 500 M_FIRSTROWW 1 80 20110
2 1900-01-01 12:30:24 500 I_SECONDROW 1 80 20110
3 1900-01-01 12:30:31 459 M_FIRSTROWW 10 93 20337
4 1900-01-01 12:30:26 459 I_SECONDROW 10 93 203377
4)为行配对创建并添加一列
groups = [val for val in range(num_groups) for _ in [0, 1]]
remainder = len(df.index) - len(groups)
groups = groups + ["-" for i in range(remainder)]
df["pair"] = groups
groups
>>> [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, '-', '-', '-', '-']
5)组行对,找时间差并为time_delta添加列
pairs = df.groupby("pair")["time"]
time_delta = []
for pair in pairs:
if len(pair[1]) == 2:
second, first = pair[1].values
time_difference = abs(int((first - second)/1000000000)) # nanoseconds to seconds
time_delta.append(time_difference)
time_delta = [val for val in time_delta for _ in [0, 1]]
remainder = len(df.index) - len(time_delta)
time_delta = time_delta + [np.NaN for i in range(remainder)]
df["time_delta"] = time_delta
df
>>> time mseid name uec mid cid pair time_delta
0 1900-01-01 12:30:24 459 I_SECONDROW 10 93 20337 0 2.0
1 1900-01-01 12:30:26 500 M_FIRSTROWW 1 80 20110 0 2.0
2 1900-01-01 12:30:24 500 I_SECONDROW 1 80 20110 1 7.0
3 1900-01-01 12:30:31 459 M_FIRSTROWW 10 93 20337 1 7.0
4 1900-01-01 12:30:26 459 I_SECONDROW 10 93 20337 2 116.0
6)最后,创建一个布尔掩码以使所有time_delta
s< = 5,然后groupby('pair')
df[df.time_delta <=5].head().groupby("pair").head()
time mseid name uec mid cid pair time_delta
0 1900-01-01 12:30:24 459 I_SECONDROW 10 93 20337 0 2.0
1 1900-01-01 12:30:26 500 M_FIRSTROWW 1 80 20110 0 2.0
注意时间戳默认为1900年,这是无关紧要的,因为我们在同一天减去时间。但是,在创建数据时应使用准确的时间戳。