Question

我有两个类似于以下内容的大数据集

DataFrame df1：

P    Y    p_start   p_stop
p1   y1      7         9
p2   y2      6         7
p3   y3      12        14

DataFrame df2：

T    t_start    t_stop 
t1      5          10
t2      11         15

我想检查P是否位于区域T内。如果是这样，我需要将df1的那一行附加到df2的相应行中。如果存在多个匹配项，我需要将它们都添加到同一行。理想情况下，我希望我的输出看起来像这样：

所需的输出：

T   t_start  t_stop   P_1   Y_1   p_start_1   p_stop_1  P_2  Y_2  p_start_2  p_stop_2
t1     5       10      p1   y1       7           9       p2   y2      6         7
t2     11      15      p3   y3      12          14

我的逻辑类似于以下内容，但我不确定如何使其真正起作用

for line in df1:
    if df1['p_start'] >= df2['t_start'] & df1['p_end'] <= df2['t_end']:
        df2 = df1.append(['X', 'Y', 'p_start', 'p_stop'])

我正在使用列名，因为我还有很多不需要附加的列。为了简单起见，我从示例数据中省略了它们。我更担心找到匹配项并附加到df2的正确行

Answer 1

使用：

# STEP 1
df3 = df2.assign(key=1).merge(df1.assign(key=1), on='key').drop('key', 1)

# STEP 2
df3 = df3[df3['t_start'].lt(df3['p_start']) & df3['t_stop'].gt(df3['p_stop'])]

# STEP 3
df3 = df3.melt(['T', 't_start', 't_stop'])

# STEP 4
df3['variable'] += '_' + df3.groupby(['T', 't_start', 't_stop', 'variable']).cumcount().add(1).astype(str)
    
# STEP 5
df3 = (
    df3.set_index(['T', 't_start', 't_stop', 'variable'])
    .unstack().droplevel(0, 1).rename_axis(columns=None).reset_index()
)

说明/步骤：

步骤1：使用DataFrame.merge合并公共临时列key上的两个数据框。通过使用合并，我们可以创建两个数据帧中行的所有可能组合，以便我们可以过滤STEP 2中满足条件的行。

# STEP 1
    T  t_start  t_stop   P   Y  p_start  p_stop
0  t1        5      10  p1  y1        7       9
1  t1        5      10  p2  y2        6       7
2  t1        5      10  p3  y3       12      14
3  t2       11      15  p1  y1        7       9
4  t2       11      15  p2  y2        6       7
5  t2       11      15  p3  y3       12      14

第2步：过滤合并数据帧df3中的行，以使p_start大于t_start，而t_stop大于p_stop，即{ {1}}和p_start位于p_stop和t_start区域。

t_stop

第3步：使用DataFrame.melt熔化数据帧# STEP 2 T t_start t_stop P Y p_start p_stop 0 t1 5 10 p1 y1 7 9 1 t1 5 10 p2 y2 6 7 5 t2 11 15 p3 y3 12 14，将列i.e转换为行。

P, Y, p_start, p_stop

步骤4：在给定的列上使用DataFrame.groupby，并使用转换# STEP 3 T t_start t_stop variable value 0 t1 5 10 P p1 1 t1 5 10 P p2 2 t2 11 15 P p3 3 t1 5 10 Y y1 4 t1 5 10 Y y2 5 t2 11 15 Y y3 6 t1 5 10 p_start 7 7 t1 5 10 p_start 6 8 t2 11 15 p_start 12 9 t1 5 10 p_stop 9 10 t1 5 10 p_stop 7 11 t2 11 15 p_stop 14并将其添加到列cumcount，以向列variable添加顺序计数器。

variable

第5步：将# STEP 4 T t_start t_stop variable value 0 t1 5 10 P_1 p1 1 t1 5 10 P_2 p2 2 t2 11 15 P_1 p3 3 t1 5 10 Y_1 y1 4 t1 5 10 Y_2 y2 5 t2 11 15 Y_1 y3 6 t1 5 10 p_start_1 7 7 t1 5 10 p_start_2 6 8 t2 11 15 p_start_1 12 9 t1 5 10 p_stop_1 9 10 t1 5 10 p_stop_2 7 11 t2 11 15 p_stop_1 14与DataFrame.unstack一起使用，以展开数据框并将set_index列中的项目作为单独的列进行透视。

variable

步骤6：如果要# STEP 5 T t_start t_stop P_1 P_2 Y_1 Y_2 p_start_1 p_start_2 p_stop_1 p_stop_2 0 t1 5 10 p1 p2 y1 y2 7 6 9 7 1 t2 11 15 p3 NaN y3 NaN 12 NaN 14 NaN数据框中的列，则可选步骤。

reorder

合并df1中的值与df2中的值相对应的行

1 个答案: