我正在处理序列数据,但是我认为问题适用于不同的范围值数据类型。 我想将来自具有起始和终止位置(范围)的一组DNA区域的读取计数(值)的几个实验合并到其他DNA区域集合(通常包含许多主要区域)的累加计数中。如以下示例所示:
为下表A提供范围和计数:
feature start end count1 count2 count3
gene1 1 10 100 30 22
gene2 15 40 20 10 6
gene3 50 70 40 11 7
gene4 100 150 23 15 9
和下表B(具有新范围):
feature start end
range1 1 45
range2 55 160
我想获得带有新范围的以下计数表:
feature start end count1 count2 count3
range1 1 45 120 40 28
range2 55 160 63 26 16
为简化起见,如果存在至少一些重叠(表B中的特征包含在表A中的特征的至少一部分),则应将其加起来。是否知道可用的工具或perl,python或R中的脚本?我正在使用bedtools multicov来计数测序读数,但是据我搜索,没有其他功能可以满足我的需求。有想法吗?
谢谢。
答案 0 :(得分:1)
您可以将apply()
和pd.concat()
与自定义函数一起使用,其中a
对应于您的第一个数据帧,而b
对应于您的第二个数据帧:
def find_englobed(x):
englobed = a[(a['start'].between(x['start'], x['end'])) | (a['end'].between(x['start'], x['end']))]
return englobed[['count1','count2','count3']].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
收益:
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16
答案 1 :(得分:1)
我们可以通过以下方式做到这一点:
key
列outer
加入(mxn)
start
或end
之间的ranges
或feature
值进行过滤pandas.DataFrame.groupby
在sum
和count
的{{1}}列上concat
的输出df2
移到df1['key'] = 'A'
df2['key'] = 'A'
df3 = pd.merge(df1,df2, on='key', how='outer')
df4 = df3[(df3.start_x.between(df3.start_y, df3.end_y)) | (df3.end_x.between(df3.start_y, df3.end_y))]
df5 = df4.groupby('feature_y').agg({'count1':'sum',
'count2':'sum',
'count3':'sum'}).reset_index()
df_final = pd.concat([df2.drop(['key'], axis=1), df5.drop(['feature_y'], axis=1)], axis=1)
,以获得所需的输出print(df_final)
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16
输出
for (int i = 0; i < numbers.size(); i++) {
sum += numbers.get(i);
// return sum;
}
答案 2 :(得分:0)
如果它可以帮助某人,根据@ rahlf23的答案,我对其进行了修改,使其更通用,考虑到一方面,计数列可以更多,并且除范围外,保持计数也很重要正确的染色体。
因此,如果表“ a”为:
feature Chromosome start end count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
表“ b”为:
feature Chromosome start end
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200
使用以下python脚本:
import pandas as pd
def find_englobed(x):
englobed = a[(a['Chromosome'] == x['Chromosome']) & (a['start'].between(x['start'], x['end']) | (a['end'].between(x['start'], x['end'])))]
return englobed[list(a.columns[4:])].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
现在,使用a['Chromosome'] == x['Chromosome'] &
,我要求它们位于同一染色体中,使用list(a.columns[4:])
,我得到从第五位到最后的所有列,而与计数列的数量无关。 / p>
我得到以下结果:
feature Chromosome start end count1 count2 count3
range1 Chr1 1 45 120.0 40.0 28.0
range2 Chr1 55 160 63.0 26.0 16.0
range3 Chr2 10 90 28.0 45.0 18.0
range4 Chr2 100 200 0.0 0.0 0.0
我不确定为什么获得的计数带有浮点数..有何评论?
答案 3 :(得分:0)
如果您正在熊猫中进行基因组研究,则可能需要研究pyranges:
import pyranges as pr
c = """feature Chromosome Start End count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
"""
c2 = """feature Chromosome Start End
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200 """
gr, gr2 = pr.from_string(c), pr.from_string(c2)
j = gr2.join(gr).drop(like="_b")
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# | feature | Chromosome | Start | End | count1 | count2 | count3 |
# | (object) | (category) | (int32) | (int32) | (int64) | (int64) | (int64) |
# |------------+--------------+-----------+-----------+-----------+-----------+-----------|
# | range1 | Chr1 | 1 | 45 | 100 | 30 | 22 |
# | range1 | Chr1 | 1 | 45 | 20 | 10 | 6 |
# | range2 | Chr1 | 55 | 160 | 40 | 11 | 7 |
# | range2 | Chr1 | 55 | 160 | 23 | 15 | 9 |
# | range3 | Chr2 | 10 | 90 | 24 | 17 | 2 |
# | range3 | Chr2 | 10 | 90 | 4 | 28 | 16 |
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# Unstranded PyRanges object has 6 rows and 7 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df
fs = {"Chromosome": "first", "Start":
"first", "End": "first", "count1": "sum", "count2": "sum", "count3": "sum"}
result = df.groupby("feature".split()).agg(fs)
# Chromosome Start End count1 count2 count3
# feature
# range1 Chr1 1 45 120 40 28
# range2 Chr1 55 160 63 26 16
# range3 Chr2 10 90 28 45 18