Question

我想在Name和Depth列上合并两个数据帧。左侧df中的深度具有单个深度列（＆＃39;深度＆＃39;）。但是，正确的df有两个深度列（＆＃39; top_depth＆＃39;＆＃39; bottom_depth＆＃39;）。

我想从左侧df中取出每条记录，如果可用的话，从右侧df分配一条记录，如果深度为＆＃39;落在＆＃39; top_depth＆＃39;和＆＃39; bottom_depth＆＃39;。

我已经整理了一些简单的数据框：

df1 = pd.DataFrame(np.array([
    ['b1', 4, 9],
    ['b1', 5, 61],
    ['b1', 15, 95],
    ['b1', 24, 9],
    ['b2', 4, 5],
    ['b2', 6, 6],
    ['b2', 44, 0]]),
    columns=['name', 'depth', 'attr1'])
df2 = pd.DataFrame(np.array([
    ['b1', 1, 6, 66],
    ['b1', 14, 16, 99],
    ['b1', 51, 55, 9],
    ['b3', 0, 5, 32]]),
    columns=['name', 'top_depth', 'bottom_depth', 'attr2'])

然后合并得到这个：

>>> df3
  name depth top_depth bottom_depth attr1 attr2
0   b1   4.0       1.0          6.0   9.0  66.0
1   b1   5.0       1.0          6.0  61.0  66.0
2   b1  15.0      14.0         16.0  95.0  99.0
3   b1    24       NaN          NaN     9   NaN
4   b2     4       NaN          NaN     5   NaN
5   b2     6       NaN          NaN     6   NaN
6   b2    44       NaN          NaN     0   NaN

我相信我可以找到一种蛮力的方法，但是必须有更好的，更多的大熊猫，这样做。

Answer 1

你可以join（在索引上）：

In [11]: df1.join(df2, how='outer', rsuffix='_')
Out[11]:
  name depth attr1 name_ top_depth bottom_depth attr2
0   b1     4     9    b1         1            6    66
1   b1     5    61    b1        14           16    99
2   b1    15    95    b1        55           51     9
3   b1    24     9    b3         0            5    32
4   b2     4     5   NaN       NaN          NaN   NaN
5   b2     6     6   NaN       NaN          NaN   NaN
6   b2    44     0   NaN       NaN          NaN   NaN

注意：rsuffix是必需的，因为名称列不匹配......目前还不清楚你想要对这种情况做什么。

注意：np.array强制数组共享（初始？）数据类型，在这种情况下，这意味着所有数字都是字符串。您可以将普通的python列表传递给DataFrame！

这是一个效率稍低的方法，首先有一个查找名称并检查深度是否位于顶部和底部的函数：

def get_depth_group(name, depth):
    arr = (df2.name == name) & (df2.bottom_depth > depth) & (depth > df2.top_depth)
    return df2.iloc[arr.argmax()] if any(arr) else np.nan

为此使用不同的数据结构可能更有效......但这样做会有效！

In [21]: df1[['depth', 'attr1']].join(df1.apply(lambda x: get_depth_group(x['name'], x['depth']), axis=1))
Out[21]:
   depth  attr1 name  top_depth  bottom_depth  attr2
0      4      9   b1          1             6     66
1      5     61   b1          1             6     66
2     15     95   b1         14            16     99
3     24      9  NaN        NaN           NaN    NaN
4      4      5  NaN        NaN           NaN    NaN
5      6      6  NaN        NaN           NaN    NaN
6     44      0  NaN        NaN           NaN    NaN

Answer 2

部分：

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([
    ['b1', 4, 9],
    ['b1', 5, 61],
    ['b1', 15, 95],
    ['b1', 24, 9],
    ['b2', 4, 5],
    ['b2', 6, 6],
    ['b2', 44, 0]]),
    columns=['name', 'depth', 'attr1'])
df2 = pd.DataFrame(np.array([
    ['b1', 1, 6, 66],
    ['b1', 14, 16, 99],
    ['b1', 51, 55, 9],
    ['b3', 0, 5, 32]]),
    columns=['name', 'top_depth', 'bottom_depth', 'attr2'])

om = pd.ordered_merge(df2, df1)
om = om.convert_objects(convert_numeric=True) 
sandwiched = om.query('(depth > top_depth) & (depth <= bottom_depth)')

夹在中间是：

  name  top_depth  bottom_depth  attr2  depth  attr1
0   b1          1             6     66      4      9
1   b1          1             6     66      5     61
6   b1         14            16     99     15     95

我认为你可以使用join来记住df1的其余部分，我无法记住。

毕竟它可能不是一个SQL形状的问题 - 您能否假设它们按深度和top_depth排序？ df2范围是否重叠？迭代每个数据帧一次可能是有效的方法。

在Pandas中，合并两个具有复杂多索引的数据帧

2 个答案: