Question

我有一个pandas数据框，看起来如下：

  chrom  start  end  probability   read
0  chr1      1   10         0.99  read1
1  chr1      5   25         0.99  read2
2  chr1     15   25         0.99  read2
3  chr1     30   40         0.75  read4

我想做的是合并具有相同染色体（chrom列）的区间，其坐标（开始，结束）重叠。在某些情况下，如果多个区间相互重叠，则会有合并的区间，即使它们不重叠。请参阅上述示例中的第0行和第2行以及下面合并的输出

对于那些合并的元素，我想总结它们的概率（概率列）并计算“读取”列中的唯一元素。

使用上面的示例会导致以下输出，请注意行0,1和2已合并：

 chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1

到目前为止，我一直在使用pybedtools merge进行此操作，但事实证明这样做的速度是数百万次（我的情况）。因此，我正在寻找其他选择，熊猫是显而易见的。我知道，对于pandas groupby ，可以对要合并的列应用不同的操作，如 nunique 和 sum ，这些是我需要申请。尽管如此，pandas groupby仅将数据与精确的'chrom'，'start'和'end'坐标合并。

我的问题是我不知道如何使用pandas根据坐标（chrom，start，end）合并我的行，然后应用 sum 和 nunique 操作。

有这么快的方法吗？

谢谢！

PS：正如我在问题上所说，我这样做了数百万次，所以速度是个大问题。因此，我无法使用pybedtools或纯python，这对我的目标来说太慢了。

谢谢！

Answer 1

如@root所建议，接受的答案无法推广到类似情况。例如如果我们在问题的示例中添加额外的一行，范围为2-3，则为：

df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'], 
    'start': [1, 2, 5, 15, 30],
    'end': [10, 3, 20, 25, 40],
    'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
    'read': ['read1','read2','read2','read2','read4']})

...建议的聚合函数输出以下数据帧。请注意，4的范围是1-10，但不再捕获。 1-10、2-3、5-20和15-25的范围都重叠，因此应归为一类。

一种解决方案是以下方法（使用@ W-B建议的聚合函数和组合间隔posted by @CentAu的方法）。

# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
    """ Union of a list of intervals e.g. [(1,2),(3,4)] """
    intervals = [Interval(begin, end) for (begin, end) in data]
    u = Union(*intervals)
    return [u] if isinstance(u, Interval) \
       else list(u.args)

# Get intervals for rows
def f(x,position=None):
    """
    Returns an interval for the row. The start and stop position indicate the minimum
        and maximum position of all overlapping ranges within the group.
    Args: 
        position (str, optional): Returns an integer indicating start or stop position.
    """
    intervals = union(x)
    if position and position.lower() == 'start':
        group = x.str[0].apply(lambda y: [l.start for g,l in enumerate(intervals) if l.contains(y)][0])
    elif position and position.lower() == 'end':
        group = x.str[0].apply(lambda y: [l.end for g,l in enumerate(intervals) if l.contains(y)][0])
    else:
        group = x.str[0].apply(lambda y: [l for g,l in enumerate(intervals) if l.contains(y)][0])
    return group

# Combine start and end into a single column
df['start_end'] = df[['start', 'end']].apply(list, axis=1)

# Assign each row to an interval and add start/end columns
df['start_interval'] = df[['chrom',
    'start_end']].groupby(['chrom']).transform(f,'start')
df['end_interval'] = df[['chrom',
    'start_end']].groupby(['chrom']).transform(f,'end')

# Aggregate rows, using approach by @W-B
df.groupby(['chrom','start_interval','end_interval']).agg({'probability':'sum',
'read':'nunique'}).reset_index()

...将输出以下数据帧。第一行的总和为3.96，因为我们要合并四行而不是三行。

虽然这种方法应该更通用，但并不一定很快！希望其他人可以提出更快的替代方案。

Answer 2

IIUC

df.groupby((df.end.shift()-df.start).lt(0).cumsum()).agg({'chrom':'first','start':'first','end':'last','probability':'sum','read':'nunique'})
Out[417]: 
  chrom  start  end  probability  read
0  chr1      1   20         2.97     2
1  chr1     30   40         0.75     1

更多信息创建组密钥

(df.end.shift()-df.start).lt(0).cumsum()
Out[418]: 
0    0
1    0
2    0
3    1
dtype: int32

Answer 3

以下是使用pyranges和熊猫的答案。改进之处在于，它可以真正快速地进行合并，即使在单核模式下也易于并行化并且超级快速。

设置：

import pandas as pd
import pyranges as pr
import numpy as np

rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)

# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1         | 149953099 | 149953199 | +            | 0.7536048547309669   | 0         |
# | chr1         | 184344435 | 184344535 | +            | 0.9358130407479777   | 1         |
# | chr1         | 238639916 | 238640016 | +            | 0.024212603310159064 | 2         |
# | chr1         | 95180042  | 95180142  | +            | 0.027139751993808026 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       |
# | chrY         | 34355323  | 34355423  | -            | 0.8843190383030953   | 999996    |
# | chrY         | 1818049   | 1818149   | -            | 0.23138017743097572  | 999997    |
# | chrY         | 10101456  | 10101556  | -            | 0.3007915302642412   | 999998    |
# | chrY         | 355910    | 356010    | -            | 0.03694752911338561  | 999999    |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

执行：

def praderas(df):
    grpby = df.groupby("Cluster")
    prob = grpby.probability.sum()
    prob.name = "ProbSum"
    n = grpby.read.count()
    n.name = "Count"

    return df.merge(prob, on="Cluster").merge(n, on="Cluster")

%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome   | Start     | End       | Strand       | probability          | read      | Cluster   | ProbSum            | Count     |
# | (category)   | (int32)   | (int32)   | (category)   | (float64)            | (int64)   | (int32)   | (float64)          | (int64)   |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1         | 52952     | 53052     | +            | 0.7411051557901921   | 59695     | 70        | 2.2131010082513884 | 3         |
# | chr1         | 52959     | 53059     | +            | 0.9979036360671423   | 356518    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 53029     | 53129     | +            | 0.47409221639405397  | 104776    | 70        | 2.2131010082513884 | 3         |
# | chr1         | 64657     | 64757     | +            | 0.32465233067499366  | 386140    | 88        | 1.3880589602361695 | 3         |
# | ...          | ...       | ...       | ...          | ...                  | ...       | ...       | ...                | ...       |
# | chrY         | 59356855  | 59356955  | -            | 0.3877207561218887   | 9966373   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356865  | 59356965  | -            | 0.4007557656399032   | 9907364   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356932  | 59357032  | -            | 0.33799123310907786  | 9978653   | 8502533   | 1.182153891322546  | 4         |
# | chrY         | 59356980  | 59357080  | -            | 0.055686136451676305 | 9994845   | 8502533   | 1.182153891322546  | 4         |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.

pandas按范围合并间隔

3 个答案: