我需要将基因型剂量文件转换为等位基因剂量文件。
输入看起来像这样:
#snp a1 a2 i1 j1 i2 j2 i3 j3
chr6_24000211_D D I3 0 0 0 0 0 0
rs78244999 A G 1 0 1 0 1 0
rs1511479 T C 0 1 1 0 0 1
rs34425199 A C 0 0 0 0 0 0
rs181892770 A G 1 0 1 0 1 0
rs501871 A G 0 1 0.997 0.003 0 1
chr6_24000836_D D I4 0 0 0 0 0 0
chr6_24000891_I I2 D 0 0 0 0 0 1
rs16888446 A C 0 0 0 0 0 0
第1-3列是标识符。不应对这些操作执行任何操作,只需将它们原样复制到输出文件中即可。对于其余列,需要将它们视为一对列i和列j,并且需要执行以下操作:2 * i + j
伪代码
write first three columns of input file to output
for all i and j in the file, write 2*i + j to output
所需的输出如下所示:
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
我将在一些具有不同总列数的文件上执行此操作,因此我希望循环运行(总列数 - 3)/ 2次迭代,即直到它到达文件的最后一列。 / p>
输入文件大约有900万行~10,000列,因此将文件读入R等程序非常慢。我不确定用于实现此功能的最有效工具(awk?perl?python?),作为新手编码器,我不确定从哪里开始re:解决方案的语法。
答案 0 :(得分:0)
这是您发布的算法的awk实现,稍微增强以生成您在预期输出中显示的第一行:
$ cat tst.awk
{
printf "%s %s %s", $1, $2, $3
c=0
for (i=4; i<NF; i+=2) {
printf " %s", (NR>1 ? 2*$i + $(i+1) : ++c)
}
print ""
}
$ awk -f tst.awk file
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
答案 1 :(得分:0)
Python版
#!/usr/bin/env python
from itertools import izip_longest, chain
def chunk(sequence, chunk_size=2):
"""
list(chunk([1,2,3,4], 2)) => [(1,2),(3,4)]
"""
# Take advantage of the same iterator being consumed
# multiple times/sources to do grouping
return izip_longest(*[iter(sequence)] * chunk_size)
def processor(csv_reader):
for row in csv_reader:
# collect the pairs and process them
processed_pairs = (2*float(i)+float(j) for i, j in chunk(row[3:]))
# yield back the first 3 element and the processed pairs
yield list(i for j in (row[0:3], processed_pairs) for i in j)
if __name__ == '__main__':
import csv, sys
with open(sys.argv[1], 'rb') as csvfile:
source = processor(csv.reader(csvfile, delimiter=' '))
for line in source:
print line
答案 2 :(得分:-1)
这会按照你的要求行事。它希望输入文件作为命令行上的参数,并将输出发送到STDOUT
,如果您愿意,可以将其重定向到文件。
use strict;
use warnings;
while (<>) {
my @fields = split;
my @probs = splice @fields, 3;
if (/^#/) {
push @fields, 1 .. @probs / 2;
}
else {
while (@probs >= 2) {
my ($i, $j) = splice @probs, 0, 2;
push @fields, $i + $i + $j;
}
}
print "@fields\n";
}
<强>输出强>
#SNP A1 A2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0