如何将计数从一组范围转移(汇总)到包含这些范围的范围?

时间:2019-03-20 17:35:25

标签: pandas dataframe bioinformatics

我正在处理序列数据,但是我认为问题适用于不同的范围值数据类型。 我想将来自具有起始和终止位置(范围)的一组DNA区域的读取计数(值)的几个实验合并到其他DNA区域集合(通常包含许多主要区域)的累加计数中。如以下示例所示:

为下表A提供范围和计数:

feature start end count1 count2 count3
gene1   1     10  100    30     22
gene2   15    40  20     10     6
gene3   50    70  40     11     7
gene4   100   150 23     15     9

和下表B(具有新范围):

feature  start  end
range1   1      45
range2   55     160

我想获得带有新范围的以下计数表:

feature  start  end  count1  count2  count3
range1   1      45   120     40      28
range2   55     160  63      26      16

为简化起见,如果存在至少一些重叠(表B中的特征包含在表A中的特征的至少一部分),则应将其加起来。是否知道可用的工具或perl,python或R中的脚本?我正在使用bedtools multicov来计数测序读数,但是据我搜索,没有其他功能可以满足我的需求。有想法吗?

谢谢。

4 个答案:

答案 0 :(得分:1)

您可以将apply()pd.concat()与自定义函数一起使用,其中a对应于您的第一个数据帧,而b对应于您的第二个数据帧:

def find_englobed(x):

    englobed = a[(a['start'].between(x['start'], x['end'])) | (a['end'].between(x['start'], x['end']))]

    return englobed[['count1','count2','count3']].sum()

pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)

收益:

  feature  start  end  count1  count2  count3
0  range1      1   45     120      40      28
1  range2     55  160      63      26      16

答案 1 :(得分:1)

我们可以通过以下方式做到这一点:

  1. 创建人工key
  2. 执行outer加入(mxn)
  3. 根据startend之间的rangesfeature值进行过滤
  4. pandas.DataFrame.groupbysumcount的{​​{1}}列上
  5. 最后将concat的输出df2移到df1['key'] = 'A' df2['key'] = 'A' df3 = pd.merge(df1,df2, on='key', how='outer') df4 = df3[(df3.start_x.between(df3.start_y, df3.end_y)) | (df3.end_x.between(df3.start_y, df3.end_y))] df5 = df4.groupby('feature_y').agg({'count1':'sum', 'count2':'sum', 'count3':'sum'}).reset_index() df_final = pd.concat([df2.drop(['key'], axis=1), df5.drop(['feature_y'], axis=1)], axis=1) ,以获得所需的输出
print(df_final)
  feature  start  end  count1  count2  count3
0  range1      1   45     120      40      28
1  range2     55  160      63      26      16

输出

for (int i = 0; i < numbers.size(); i++) {
    sum += numbers.get(i);
//  return sum;
}

答案 2 :(得分:0)

如果它可以帮助某人,根据@ rahlf23的答案,我对其进行了修改,使其更通用,考虑到一方面,计数列可以更多,并且除范围外,保持计数也很重要正确的染色体。

因此,如果表“ a”为:

feature Chromosome  start   end count1  count2  count3
gene1   Chr1        1       10  100     30      22
gene2   Chr1        15      40  20      10      6
gene3   Chr1        50      70  40      11      7
gene4   Chr1        100     150 23      15      9
gene5   Chr2        5       30  24      17      2
gene5   Chr2        40      80  4       28     16

表“ b”为:

feature Chromosome  start   end
range1  Chr1        1       45
range2  Chr1        55      160
range3  Chr2        10      90
range4  Chr2        100     200

使用以下python脚本:

import pandas as pd

def find_englobed(x):
    englobed = a[(a['Chromosome'] == x['Chromosome']) & (a['start'].between(x['start'], x['end']) | (a['end'].between(x['start'], x['end'])))]
    return englobed[list(a.columns[4:])].sum()

pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)

现在,使用a['Chromosome'] == x['Chromosome'] &,我要求它们位于同一染色体中,使用list(a.columns[4:]),我得到从第五位到最后的所有列,而与计数列的数量无关。 / p>

我得到以下结果:

feature Chromosome  start   end count1  count2  count3
range1  Chr1        1       45  120.0   40.0    28.0
range2  Chr1        55      160 63.0    26.0    16.0
range3  Chr2        10      90  28.0    45.0    18.0
range4  Chr2        100     200 0.0     0.0     0.0

我不确定为什么获得的计数带有浮点数..有何评论?

答案 3 :(得分:0)

如果您正在熊猫中进行基因组研究,则可能需要研究pyranges

import pyranges as pr

c = """feature Chromosome  Start   End count1  count2  count3
gene1   Chr1        1       10  100     30      22
gene2   Chr1        15      40  20      10      6
gene3   Chr1        50      70  40      11      7
gene4   Chr1        100     150 23      15      9
gene5   Chr2        5       30  24      17      2
gene5   Chr2        40      80  4       28     16
"""

c2 = """feature Chromosome  Start   End
range1  Chr1        1       45
range2  Chr1        55      160
range3  Chr2        10      90
range4  Chr2        100     200 """

gr, gr2 = pr.from_string(c), pr.from_string(c2)

j = gr2.join(gr).drop(like="_b")
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# | feature    | Chromosome   |     Start |       End |    count1 |    count2 |    count3 |
# | (object)   | (category)   |   (int32) |   (int32) |   (int64) |   (int64) |   (int64) |
# |------------+--------------+-----------+-----------+-----------+-----------+-----------|
# | range1     | Chr1         |         1 |        45 |       100 |        30 |        22 |
# | range1     | Chr1         |         1 |        45 |        20 |        10 |         6 |
# | range2     | Chr1         |        55 |       160 |        40 |        11 |         7 |
# | range2     | Chr1         |        55 |       160 |        23 |        15 |         9 |
# | range3     | Chr2         |        10 |        90 |        24 |        17 |         2 |
# | range3     | Chr2         |        10 |        90 |         4 |        28 |        16 |
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# Unstranded PyRanges object has 6 rows and 7 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.

df = j.df

fs = {"Chromosome": "first", "Start":
      "first", "End": "first", "count1": "sum", "count2": "sum", "count3": "sum"}
result = df.groupby("feature".split()).agg(fs)
#         Chromosome  Start  End  count1  count2  count3
# feature
# range1        Chr1      1   45     120      40      28
# range2        Chr1     55  160      63      26      16
# range3        Chr2     10   90      28      45      18