我有一个熊猫df:
number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472
我正在尝试将具有不同数字的行分组,但是其中start
在+/- 10
内 AND 的末尾也在同一行的+/- 10
内染色体。
在此示例中,我想找到这两行:
24 s5 3 17286051 3 17311472
26 s4 3 17286052 3 17311472
其中chrom1
[3]
和chrom2
[3]
相同,并且start
和结束值彼此为+/- 10
,并将它们分组为相同的数字:
24 s5 3 17286051 3 17311472
24 s4 3 17286052 3 17311472 # Change the number to the first seen in this series
这就是我要尝试的:
import pandas as pd
from collections import defaultdict
def parse_vars(inFile):
df = pd.read_csv(inFile, delimiter="\t")
df = df[['number', 'chrom1', 'start', 'chrom2', 'end']]
vars = {}
seen_l = defaultdict(lambda: defaultdict(dict)) # To track the `starts`
seen_r = defaultdict(lambda: defaultdict(dict)) # To track the `ends`
for index in df.index:
event = df.loc[index, 'number']
c1 = df.loc[index, 'chrom1']
b1 = int(df.loc[index, 'start'])
c2 = df.loc[index, 'chrom2']
b2 = int(df.loc[index, 'end'])
print [event, c1, b1, c2, b2]
vars[event] = [c1, b1, c2, b2]
# Iterate over windows +/- 10
for i, j in zip( range(b1-10, b1+10), range(b2-10, b2+10) ):
# if :
# i in seen_l[c1] AND
# j in seen_r[c2] AND
# the 'number' for these two instances is the same:
if i in seen_l[c1] and j in seen_r[c2] and seen_l[c1][i] == seen_r[c2][j]:
print seen_l[c1][i], seen_r[c2][j]
if seen_l[c1][i] != event: print"Seen: %s %s in event %s %s" % (event, [c1, b1, c2, b2], seen_l[c1][i], vars[seen_l[c1][i]])
seen_l[c1][b1] = event
seen_r[c2][b2] = event
我遇到的问题是,seen_l[3][17286052]
numbers
和25
中都存在26
,并且它们各自的seen_r
事件都存在({{ 1}},seen_r[3][17294628] = 25
)不相等,我无法将这些行连接在一起。
是否可以使用seen_r[3][17311472] = 26
值列表作为start
字典的嵌套键?
答案 0 :(得分:0)
在pyranges中,间隔重叠很容易。下面的大多数代码是将开始和结束分成两个不同的df。然后根据+ -10的时间间隔重叠将其合并:
from io import StringIO
import pandas as pd
import pyranges as pr
c = """number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472"""
df = pd.read_table(StringIO(c), sep="\s+")
df1 = df[["chrom1", "start", "number", "sample"]]
df1.insert(2, "end", df.start + 1)
df2 = df[["chrom2", "end", "number", "sample"]]
df2.insert(2, "start", df.end - 1)
names = ["Chromosome", "Start", "End", "number", "sample"]
df1.columns = names
df2.columns = names
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)
j = gr1.join(gr2, slack=10)
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# | Chromosome | Start | End | number | sample | Start_b | End_b | number_b | sample_b |
# | (category) | (int32) | (int32) | (int64) | (object) | (int32) | (int32) | (int64) | (object) |
# |--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------|
# | 3 | 3125694 | 3125695 | 20 | s4 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125694 | 3125695 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125695 | 3125696 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125700 | 3125701 | 20 | s5 | 3125700 | 3125699 | 19 | s2 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 25 | s2 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s5 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s1 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s4 |
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# Unstranded PyRanges object has 13 rows and 9 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
# to get the data as a pandas df:
jdf = j.df