我有一个这样的文件:
chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3
我想做的是: 设置一个窗口大小(假设为2),沿着文件移动它,并计算窗口内第4列的平均值和G + C的%。
我现在有这样的想法:
import numpy
def movingaverage(interval, window_size):
window = numpy.ones(int(window_size))/float(window_size)
return numpy.convolve(interval, window, 'same')
它为第4列做了工作,但我不知道如何应用它来计算窗框内G + C的内容
干杯, Irek
答案 0 :(得分:1)
from __future__ import division
from itertools import tee, izip
from collections import Counter
text = '''\
chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3'''
def window(iterable, size):
iters = tee(iterable, size)
for i in xrange(1, size):
for each in iters[i:]:
next(each, None)
return izip(*iters)
def get_avg(lists, column):
return sum(zip(*lists)[column]) / len(lists)
def get_GC_percentage(lists, column):
counts = Counter(zip(*lists)[column])
return (counts['C'] + counts['G']) / len(lists)
line_tuples = (line.split() for line in text.split('\n'))
line_tuples_casted = ((a,int(b),c,int(d)) for a,b,c,d in line_tuples)
line_tuples_chunks = window(line_tuples_casted, 2)
for (i,chunk) in enumerate(line_tuples_chunks):
print 'i: {:2} | avg: {} | GC_content: {:5.0%}'.format(i, get_avg(chunk, 3), get_GC_percentage(chunk, 2))
输出:
i: 0 | avg: 3.0 | GC_content: 50%
i: 1 | avg: 3.0 | GC_content: 50%
i: 2 | avg: 2.5 | GC_content: 50%
i: 3 | avg: 1.5 | GC_content: 100%
i: 4 | avg: 1.5 | GC_content: 50%
i: 5 | avg: 2.5 | GC_content: 50%
i: 6 | avg: 3.0 | GC_content: 100%
i: 7 | avg: 4.0 | GC_content: 50%
i: 8 | avg: 6.5 | GC_content: 0%
i: 9 | avg: 4.5 | GC_content: 0%
i: 10 | avg: 0.5 | GC_content: 50%
i: 11 | avg: 0.0 | GC_content: 100%
i: 12 | avg: 0.0 | GC_content: 100%
i: 13 | avg: 1.0 | GC_content: 100%
i: 14 | avg: 2.5 | GC_content: 50%
i: 15 | avg: 3.0 | GC_content: 0%
但请注意,这不是最佳解决方案。我们可以通过不计算整个窗口的每次迭代的平均值来做得更好,但是使用离开窗口的值来更新它。
答案 1 :(得分:1)
如果我明白你想做什么,这是一种方式(不是最佳但有效):
第一个数据文件 file.data
chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3
现在脚本:
import numpy as np
d = {'A':0, 'G': 1, 'T':2, 'C':3, 'U':4}
data = np.loadtxt('file.data', delimiter=' ', converters = {0: lambda x: int(x[-1]), 2: lambda x: d[x]})
win_size = 2
for i in range(data.shape[0] / win_size):
m = data[i:i+win_size,:]
avg = np.mean(m[:,3])
cg_per = float(np.where( ( m[:,2] == d['G'] )| ( m[:,2] == d['C']) )[0].shape[0]) * 100 / win_size
print "Window {0} avg:{1} C+G={2}%".format(i, avg, cg_per)
它将生成:
Window 0 avg:3.0 C+G=50.0%
Window 1 avg:3.0 C+G=50.0%
Window 2 avg:2.5 C+G=50.0%
Window 3 avg:1.5 C+G=100.0%
Window 4 avg:1.5 C+G=50.0%
Window 5 avg:2.5 C+G=50.0%
Window 6 avg:3.0 C+G=100.0%
Window 7 avg:4.0 C+G=50.0%
答案 2 :(得分:1)
您似乎已经知道如何将数据转换为numpy数组,因此假设您将它们放在名为bases
的数组中。如果你现在这样做:
base_mask = (bases == 'G') | (basses == 'C')
如果数组有G或C,你有一个带True
的布尔掩码,其他地方有False
。由于布尔值为整数的布尔值将True
转换为1
s且将False
转换为0
s,因此只需计算base_mask
数组上的平均值即可与interval
相同的方式。
答案 3 :(得分:1)
Deques非常适合滑动窗户。请尝试以下方法:
from collections import deque
from itertools import islice
FN = "temp.txt"
WSIZE = 2
def gen_stream(f):
for line in f:
line = line.split()
yield [1 if line[2] in 'GC' else 0, int(line[3])]
def overlapping():
with open(FN) as f:
stream = gen_stream(f)
window = deque([stream.next() for _ in xrange(WSIZE-1)], WSIZE)
for row in stream:
window.append(row)
print [sum(row[i] for row in window)/float(WSIZE) for i in xrange(2)]
def non_overlapping():
with open(FN) as f:
stream = gen_stream(f)
while True:
chunk = list(islice(stream, WSIZE))
if not chunk:
break
print [sum(row[i] for row in chunk)/float(len(chunk)) for i in xrange(2)]
这是可扩展的,即适用于庞大的文件。