熊猫的间隔交汇点

时间:2017-02-01 15:44:26

标签: python pandas interval-tree

更新5:

此功能已作为pandas 20.1的一部分发布(在我的生日:])

更新4:

PR合并了!

更新3:

The PR has moved here

更新2:

似乎这个问题可能对re-opening the PR for IntervalIndex in pandas有所贡献。

更新

我不再有这个问题,因为我现在实际上正在查询var index = userList.FindIndex(r => r.IdUser == 3); if (index != -1) { userList[index] = newUser; } A的重叠范围,而不是来自B的{​​{1}}范围内B },这是一个完整的区间树问题。我不会删除这个问题,因为我认为这仍然是一个有效的问题,我没有一个好的答案。

问题陈述

我有两个数据帧。

在数据帧A中,两个整数列一起表示一个间隔。

在数据框A中,一个整数列表示一个位置。

我想进行一种连接,这样就可以为每个间隔分配点数。

间隔很少但偶尔会重叠。如果一个点落在该重叠范围内,则应将其分配给两个间隔。大约一半的点不会落在一个区间内,但几乎每个区间都会在其范围内至少有一个点。

我一直在想什么

我最初要从大熊猫中转储数据,并使用intervaltreebanyanbx-python,但后来我遇到了这个gist。事实证明,玩家在那里的想法从来没有变成大熊猫,但它让我思考 - 它可能在熊猫中做到这一点,因为我希望这段代码能够像python一样快,我不要把我的数据从大熊猫中丢弃,直到最后。我也觉得这可以通过B和pandas cut功能实现,但我是熊猫的全新手,所以我可以使用一些指导!谢谢!

注释

潜在相关? Pandas DataFrame groupby overlapping intervals of variable length

2 个答案:

答案 0 :(得分:3)

此功能已作为pandas 20.1

的一部分发布

答案 1 :(得分:1)

使用pyranges进行回答,基本上是熊猫身上撒了生物信息学糖。

设置:

import numpy as np
np.random.seed(0)
import pyranges as pr

a = pr.random(int(1e6))
# +--------------+-----------+-----------+--------------+
# | Chromosome   | Start     | End       | Strand       |
# | (category)   | (int32)   | (int32)   | (category)   |
# |--------------+-----------+-----------+--------------|
# | chr1         | 8830650   | 8830750   | +            |
# | chr1         | 9564361   | 9564461   | +            |
# | chr1         | 44977425  | 44977525  | +            |
# | chr1         | 239741543 | 239741643 | +            |
# | ...          | ...       | ...       | ...          |
# | chrY         | 29437476  | 29437576  | -            |
# | chrY         | 49995298  | 49995398  | -            |
# | chrY         | 50840129  | 50840229  | -            |
# | chrY         | 38069647  | 38069747  | -            |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

b = pr.random(int(1e6), length=1)
# +--------------+-----------+-----------+--------------+
# | Chromosome   | Start     | End       | Strand       |
# | (category)   | (int32)   | (int32)   | (category)   |
# |--------------+-----------+-----------+--------------|
# | chr1         | 52110394  | 52110395  | +            |
# | chr1         | 122640219 | 122640220 | +            |
# | chr1         | 162690565 | 162690566 | +            |
# | chr1         | 117198743 | 117198744 | +            |
# | ...          | ...       | ...       | ...          |
# | chrY         | 45169886  | 45169887  | -            |
# | chrY         | 38863683  | 38863684  | -            |
# | chrY         | 28592193  | 28592194  | -            |
# | chrY         | 29441949  | 29441950  | -            |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

执行:

result = a.join(b, strandedness="same")
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# | Chromosome   | Start     | End       | Strand       | Start_b   | End_b     | Strand_b     |
# | (category)   | (int32)   | (int32)   | (category)   | (int32)   | (int32)   | (category)   |
# |--------------+-----------+-----------+--------------+-----------+-----------+--------------|
# | chr1         | 227348436 | 227348536 | +            | 227348516 | 227348517 | +            |
# | chr1         | 18901135  | 18901235  | +            | 18901191  | 18901192  | +            |
# | chr1         | 230131576 | 230131676 | +            | 230131636 | 230131637 | +            |
# | chr1         | 84829850  | 84829950  | +            | 84829903  | 84829904  | +            |
# | ...          | ...       | ...       | ...          | ...       | ...       | ...          |
# | chrY         | 44139791  | 44139891  | -            | 44139821  | 44139822  | -            |
# | chrY         | 51689785  | 51689885  | -            | 51689859  | 51689860  | -            |
# | chrY         | 45379140  | 45379240  | -            | 45379215  | 45379216  | -            |
# | chrY         | 37469479  | 37469579  | -            | 37469576  | 37469577  | -            |
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 16,153 rows and 7 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

df = result.df
#       Chromosome      Start        End Strand    Start_b      End_b Strand_b
# 0           chr1  227348436  227348536      +  227348516  227348517        +
# 1           chr1   18901135   18901235      +   18901191   18901192        +
# 2           chr1  230131576  230131676      +  230131636  230131637        +
# 3           chr1   84829850   84829950      +   84829903   84829904        +
# 4           chr1  189088140  189088240      +  189088163  189088164        +
# ...          ...        ...        ...    ...        ...        ...      ...
# 16148       chrY   38968068   38968168      -   38968124   38968125        -
# 16149       chrY   44139791   44139891      -   44139821   44139822        -
# 16150       chrY   51689785   51689885      -   51689859   51689860        -
# 16151       chrY   45379140   45379240      -   45379215   45379216        -
# 16152       chrY   37469479   37469579      -   37469576   37469577        -
# 
# [16153 rows x 7 columns]