Question

我对如何最有效地使用pandas做到这一点感到困惑。

我有以下熊猫DataFrame，目前包含两列starts和ends，分别代表间隔[1, 10]，[5, 15]和{{1} }。

[3, 8]

从0开始，我要计算间隔如何重叠。这是正确的合并结构（不必过多担心关闭/打开间隔）：

间隔import pandas as pd dict1 = {'start': [1, 5, 3], 'end': [10, 15, 8]} df = pd.DataFrame(dict1) print(df) start end 0 1 10 1 5 15 2 3 8没有间隔，[0, 1]有1个间隔（来自[1,3]），[1, 10]有两个间隔（成对[3, 5]和{{1 }}），间隔[1, 10]具有三个间隔（[3, 8]），[5, 8]具有两个间隔（[1, 10], [3, 8], [5, 15]），等等。

以表格式汇总结果，预期结果将是：

[8, 10]

因此，[1, 10], [5, 15]列当前是包含每个间隔列表的列表的列表。（我包括一个大于15的整数，以表明那里什么都没有； 75是任意的）

我应该如何用熊猫来完成上述工作？三个步骤似乎是：

（1）将间隔分解为分段，并赋予其他任何间隔

（2）计算重叠间隔

（3）存储间隔以供以后检索

start end total interval 0 0 1 0 [] 1 1 3 1 [[1, 10]] 2 3 5 2 [[1, 10], [3, 8]] 3 5 8 3 [[1, 10], [3, 8], [5, 15]] 4 8 10 2 [[1, 10], [5, 15]] 5 10 15 1 [[5, 15]] 6 15 75 0 []甚至可以进行此操作吗？

Answer 1

我正在使用numpy boardcast

s1=df1.end.values
s2=df1.start.values
s3=df2.end.values
s4=df2.start.values
f=pd.DataFrame(((s1[:,None]>=s3)&(s2[:,None]<=s4)).T,index=df2.index)
df2['total']=f.sum(1)
df2['interval']=[(df1.values[x]).tolist() for x in f.values]
df2
Out[289]: 
   start  end  total                    interval
0      0    1      0                          []
1      1    3      1                   [[1, 10]]
2      3    5      2           [[1, 10], [3, 8]]
3      5    8      3  [[1, 10], [5, 15], [3, 8]]
4      8   10      2          [[1, 10], [5, 15]]
5     10   15      1                   [[5, 15]]
6     15   75      0                          []

Answer 2

从pandas 0.24.0可以使用pd.Interval.overlaps：

endpoints = df.stack().sort_values().reset_index(drop=True)
intervals = pd.DataFrame({'start':endpoints.shift().fillna(0), 
                          'end':endpoints}).astype(int)
# construct the list of intervals from the endpoints
intervals['intv'] = [pd.Interval(a,b) for a,b in zip(intervals.start, intervals.end)]

# these are the original intervals
orig_invt = pd.arrays.IntervalArray([pd.Interval(a,b) for a,b in zip(df.start, df.end)])

# walk through the intervals and compute the intersections
intervals['total'] = intervals.intv.apply(lambda x: org_intv.overlaps(x).sum())

输出：

+----+--------+------+-----------+-------+
|    | start  | end  |   intv    | total |
+----+--------+------+-----------+-------+
| 0  |     0  |   1  | (0, 1]    |     0 |
| 1  |     1  |   3  | (1, 3]    |     1 |
| 2  |     3  |   5  | (3, 5]    |     2 |
| 3  |     5  |   8  | (5, 8]    |     3 |
| 4  |     8  |  10  | (8, 10]   |     2 |
| 5  |    10  |  15  | (10, 15]  |     1 |
+----+--------+------+-----------+-------+

Answer 3

使用标准的循环方法：

bounds = np.unique(df)
if 0 not in bounds: bounds = np.insert(bounds, 0, 0)

end = 75
bounds = np.append(bounds, end)

total = []
interval = []
for i in range(len(bounds)-1):
    # Find which intervals fit
    ix = (df['start'] <= bounds[i]) & (df['end'] >= bounds[i+1])

    total.append(np.sum(ix))
    interval.append(df[ix].values.tolist())

pd.DataFrame({'start': bounds[:-1], 'end': bounds[1:], 'total': total, 'interval': interval})

如何计算重叠并找到大熊猫的重叠伙伴？

3 个答案: