基于其他数据框中的值的条件替换行

时间:2019-04-14 16:30:10

标签: python pandas

我有两个数据帧,我想减少第一个数据帧中的信息,如下所示:

event_timestamp      message_number  an_robot
2015-04-15 12:09:39  10125            robot_7
2015-04-15 12:09:41  10053            robot_4
2015-04-15 12:09:44  10156            robot_7
2015-04-15 12:09:47  20205            robot_108
2015-04-15 12:09:51  10010            robot_38
2015-04-15 12:09:54  10012            robot_65
2015-04-15 12:09:59  10011            robot_39

其他数据框如下:

sequence             support
10053,10156,20205    0.94783
10010,10012          0.93322

我想替换数据帧1中所有明显出现在数据帧2中的序列,因此新数据帧应为:

event_timestamp      message_number    an_robot
2015-04-15 12:09:39  10125              robot_7
2015-04-15 12:09:41  10053,10156,20205  robot_4,robot_7,robot_108
2015-04-15 12:09:51  10010,10012        robot_38,robot_65
2015-04-15 12:09:59  10011              robot_39

谁知道如何实现这一目标?我知道如何查找这些值是否恰好在一行中匹配,但是不比较必须彼此紧接的多行。

---编辑---

也许可以使它更简单一些,也可以为序列生成一个新的message_number。因此,新的数据框可能是:

event_timestamp      message_number    an_robot
2015-04-15 12:09:39  10125              robot_7
2015-04-15 12:09:41  1                  robot_4,robot_7,robot_108
2015-04-15 12:09:51  2                  robot_38,robot_65
2015-04-15 12:09:59  10011              robot_39

在序列数据帧中找到的每个序列将被写为0、1、2、3或4(直到最后一个序列)。我总是可以使用这些新编号来更新message_number代码的含义的数据库。最好保留有关由哪个机器人执行的信息,但是如果这太复杂了,那么也很好。

2 个答案:

答案 0 :(得分:1)

我正在为您的df2使用unnesting,然后将规则映射回df,并获取groupkey,然后将groupbyagg一起使用

df1.sequence=df1.sequence.str.split(',')
s=unnesting(df1,['sequence'])

groupkey=df.message_nummber.map(dict(zip(s.sequence.astype(int),s.index))).fillna(df.message_nummber)

df.groupby(groupkey).agg({'event_timestamp':'first','message_nummber':lambda x : ','.join(str(x)),'an_robot':','.join})
                    event_timestamp            ...                               an_robot
message_nummber                                ...
0.0              2015-04-1512:09:41            ...              robot_4,robot_7,robot_108
1.0              2015-04-1512:09:51            ...                      robot_38,robot_65
10011.0          2015-04-1512:09:59            ...                               robot_39
10125.0          2015-04-1512:09:39            ...                                robot_7
[4 rows x 3 columns]

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

答案 1 :(得分:1)

如果您想简化它,可能会有点长,但是工作流程看起来不错,或者看起来像是数据管道。

df_str_2 = """sequence|support
10053,10156,20205|0.94783
10010,10012|0.93322"""

df_2 = pd.read_csv(io.StringIO(df_str_2), sep='|')

# step 1: transform the df 2
# add a id column
df_2["_id"] = df_2.index + 1 
# split sequence to list
df_2["sequence"] = df_2.sequence.apply(lambda x: x.split(",") if isinstance(x, str) else [])

# put each item from the list to a new row
trns_df_2 = (
    df_2.sequence.apply(pd.Series)
    .merge(df_2, right_index=True, left_index=True)
    .drop(["sequence"], axis=1)
    .melt(id_vars=['support', '_id'], value_name="message_number")
    .drop(["variable", "support"], axis=1)
    .dropna()
    .sort_values("_id", ascending=True)
)
# step 2: merge with df 1
df_str_1 = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:59|10011|robot_39"""

df_1 = pd.read_csv(io.StringIO(df_str_1), sep='|')
df_1["message_number"] = df_1.message_number.astype(str)

merged_df = df_1.merge(trns_df_2, on="message_number", how="left")

# take only the inner join and group them by id and other column to list
main_df_inner = (
    merged_df[merged_df["_id"].notnull()]
    .groupby("_id")
    .agg({"event_timestamp": lambda x: list(x),
          "message_number": lambda x: list(x),
          "an_robot": lambda x: list(x)})
    .reset_index()
    .drop("_id", axis=1)
)

# joined the list items in to a list
main_df_inner["event_timestamp"] = main_df_inner.event_timestamp.apply(lambda x: x[0])
main_df_inner["message_number"] = main_df_inner.message_number.apply(lambda x: ",".join(x))
main_df_inner["an_robot"] = main_df_inner.an_robot.apply(lambda x: ",".join(x))

# take only the left part
main_df_left = merged_df[merged_df["_id"].isnull()].drop("_id", axis=1)

# concate the both part and make the final df
main_df = pd.concat([main_df_left, main_df_inner])

剩下的事情是使用pd.to_datetime将event_timestamp列转换为datetime,并按event_timestamp排序数据帧。我想你可以自己做。