这是我的问题: 我从几个矩阵开始,提取数据来构建一个新的通用矩阵。 第一步是使用csv模块读取infiles并提取“position”值(存储在row [1]中),这些值将用作最终矩阵中的列标题。每个infile包含总“位置”的子集,有时在一个或多个infile中存在。所以我首先从所有“位置”值的合并构建一个有序列表(从小到大整数),忽略重复的值。这就是我这样做的方式:
for infile in glob.glob('passed_*.vcf'):
infilen=open(infile)
inf = csv.reader(infilen,delimiter='\t')
for row in inf:
if row[1] in pos:
continue
else:
pos.append(row[1])
infilen.close()
pos.sort(key=int)
head=str('\t'.join(pos))
of=open('trial.txt', 'a')
print>>of,head
一旦完成,我回到原始的infiles并读取另一个值(在这次的行[3]中),我想在上面创建的相应标题(即“位置”)下添加。由于每个infile都有一个总位置的子集,所以当最终矩阵位置(存储在列表“pos”中)不存在于单个infile的行[1]中时,我将不得不填补空白。 这里是我正在尝试的代码:
for infile in glob.glob('passed_*.vcf'):
infilen=open(infile)
inf = csv.reader(infilen,delimiter='\t')
seq=[]
for row in inf:
if row[1] in pos:
seq.append(row[3])
else:
seq.append('N')
毋庸置疑,我被困住了。我想要使用while循环,但由于我没有真正的经验,我会问你任何形式的建议。
输入(样本1):
1 2025 blah A . blah PASS AC=0 GT:DP 0/0:61
2 2027 blah C . blah blah AC=0 GT:DP 0/0:61
3 2028 blah T . blah PASS AC=0 GT:DP 0/0:61
输入(样本n):
1 2025 blah G . blah PASS AC=0 GT:DP 0/0:61
2 2026 blah A . blah blah AC=0 GT:DP 0/0:61
3 3089 blah T . blah PASS AC=0 GT:DP 0/0:61
输出(单个矩阵,输入行[1]作为变量,行[3]作为值。每行是不同的样本,即不同的输入文件):
2025 2026 2027 2028 ... 3089
sample1 A NaN C T NaN
samplen G A NaN NaN T
答案 0 :(得分:0)
>>> from collections import defaultdict
>>> import glob
>>> pos = defaultdict(dict)
>>> for index, infile in enumerate(glob.glob('D:\\DATA\\FP12210\\My Documents\\Temp\\Python\\sample*.vcf'), 1):
for line in open(infile):
# Convert value in integer already
val, letter = int(line.split()[1]), line.split()[3]
pos[val][index] = letter
>>> def print_pos(pos):
""" Formats pos """
# Print header by sorting keys of pos
values = sorted(pos.keys())
print ' ',
for val in range(values[0], values[-1] + 1):
print '{0:5}'.format(val),
print
# pos has keys according to row1, create pos2 with keys = sample #
pos2 = defaultdict(dict)
for val, d in pos.iteritems():
for index, letter in d.iteritems():
pos2[index][val] = letter
# Now easier to print lines
for index in sorted(pos2.keys()):
print ' sample{0:2} '.format(index),
for val in range(values[0], values[-1] + 1):
if val in pos2[index]:
print ' {0} '.format(pos2[index][val]),
else:
print ' NaN ',
print
>>> print_pos(pos)
2025 2026 2027 2028 2029 2030 2031 2032
sample 1 A NaN C T NaN NaN NaN NaN
sample 2 G A NaN NaN NaN NaN NaN T
>>>
我使用pos
来收集值,我还使用pos2
包含为打印目的而不同排序的相同数据,因为:
pos
是面向价值的,对于具有值范围非常有用pos2
面向样本,对于给出样本编号没有范围太大,我使用了值:
-sample1.vcf:
1 2025 blah A . blah PASS AC=0 GT:DP 0/0:61
2 2027 blah C . blah blah AC=0 GT:DP 0/0:61
3 2028 blah T . blah PASS AC=0 GT:DP 0/0:61
-sample2.vcf:
1 2025 blah G . blah PASS AC=0 GT:DP 0/0:61
2 2026 blah A . blah blah AC=0 GT:DP 0/0:61
3 2032 blah T . blah PASS AC=0 GT:DP 0/0:61