我有两个数据帧,我想减少第一个数据帧中的信息,如下所示:
event_timestamp message_number an_robot
2015-04-15 12:09:39 10125 robot_7
2015-04-15 12:09:41 10053 robot_4
2015-04-15 12:09:44 10156 robot_7
2015-04-15 12:09:47 20205 robot_108
2015-04-15 12:09:51 10010 robot_38
2015-04-15 12:09:54 10012 robot_65
2015-04-15 12:09:59 10011 robot_39
其他数据框如下:
sequence support
10053,10156,20205 0.94783
10010,10012 0.93322
我想替换数据帧1中所有明显出现在数据帧2中的序列,因此新数据帧应为:
event_timestamp message_number an_robot
2015-04-15 12:09:39 10125 robot_7
2015-04-15 12:09:41 10053,10156,20205 robot_4,robot_7,robot_108
2015-04-15 12:09:51 10010,10012 robot_38,robot_65
2015-04-15 12:09:59 10011 robot_39
谁知道如何实现这一目标?我知道如何查找这些值是否恰好在一行中匹配,但是不比较必须彼此紧接的多行。
---编辑---
也许可以使它更简单一些,也可以为序列生成一个新的message_number。因此,新的数据框可能是:
event_timestamp message_number an_robot
2015-04-15 12:09:39 10125 robot_7
2015-04-15 12:09:41 1 robot_4,robot_7,robot_108
2015-04-15 12:09:51 2 robot_38,robot_65
2015-04-15 12:09:59 10011 robot_39
在序列数据帧中找到的每个序列将被写为0、1、2、3或4(直到最后一个序列)。我总是可以使用这些新编号来更新message_number代码的含义的数据库。最好保留有关由哪个机器人执行的信息,但是如果这太复杂了,那么也很好。
答案 0 :(得分:1)
我正在为您的df2使用unnesting,然后将规则映射回df,并获取groupkey,然后将groupby
与agg
一起使用
df1.sequence=df1.sequence.str.split(',')
s=unnesting(df1,['sequence'])
groupkey=df.message_nummber.map(dict(zip(s.sequence.astype(int),s.index))).fillna(df.message_nummber)
df.groupby(groupkey).agg({'event_timestamp':'first','message_nummber':lambda x : ','.join(str(x)),'an_robot':','.join})
event_timestamp ... an_robot
message_nummber ...
0.0 2015-04-1512:09:41 ... robot_4,robot_7,robot_108
1.0 2015-04-1512:09:51 ... robot_38,robot_65
10011.0 2015-04-1512:09:59 ... robot_39
10125.0 2015-04-1512:09:39 ... robot_7
[4 rows x 3 columns]
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
答案 1 :(得分:1)
如果您想简化它,可能会有点长,但是工作流程看起来不错,或者看起来像是数据管道。
df_str_2 = """sequence|support
10053,10156,20205|0.94783
10010,10012|0.93322"""
df_2 = pd.read_csv(io.StringIO(df_str_2), sep='|')
# step 1: transform the df 2
# add a id column
df_2["_id"] = df_2.index + 1
# split sequence to list
df_2["sequence"] = df_2.sequence.apply(lambda x: x.split(",") if isinstance(x, str) else [])
# put each item from the list to a new row
trns_df_2 = (
df_2.sequence.apply(pd.Series)
.merge(df_2, right_index=True, left_index=True)
.drop(["sequence"], axis=1)
.melt(id_vars=['support', '_id'], value_name="message_number")
.drop(["variable", "support"], axis=1)
.dropna()
.sort_values("_id", ascending=True)
)
# step 2: merge with df 1
df_str_1 = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:59|10011|robot_39"""
df_1 = pd.read_csv(io.StringIO(df_str_1), sep='|')
df_1["message_number"] = df_1.message_number.astype(str)
merged_df = df_1.merge(trns_df_2, on="message_number", how="left")
# take only the inner join and group them by id and other column to list
main_df_inner = (
merged_df[merged_df["_id"].notnull()]
.groupby("_id")
.agg({"event_timestamp": lambda x: list(x),
"message_number": lambda x: list(x),
"an_robot": lambda x: list(x)})
.reset_index()
.drop("_id", axis=1)
)
# joined the list items in to a list
main_df_inner["event_timestamp"] = main_df_inner.event_timestamp.apply(lambda x: x[0])
main_df_inner["message_number"] = main_df_inner.message_number.apply(lambda x: ",".join(x))
main_df_inner["an_robot"] = main_df_inner.an_robot.apply(lambda x: ",".join(x))
# take only the left part
main_df_left = merged_df[merged_df["_id"].isnull()].drop("_id", axis=1)
# concate the both part and make the final df
main_df = pd.concat([main_df_left, main_df_inner])
剩下的事情是使用pd.to_datetime将event_timestamp列转换为datetime,并按event_timestamp排序数据帧。我想你可以自己做。