循环以将列i乘以2,添加列j,并对文件中的所有列对重复

时间:2014-11-30 23:17:28

标签: python perl awk text-formatting

我需要将基因型剂量文件转换为等位基因剂量文件。

输入看起来像这样:

    #snp a1 a2 i1 j1 i2 j2 i3 j3
    chr6_24000211_D D I3 0 0 0 0 0 0
    rs78244999 A G 1 0 1 0 1 0
    rs1511479 T C 0 1 1 0 0 1
    rs34425199 A C 0 0 0 0 0 0
    rs181892770 A G 1 0 1 0 1 0
    rs501871 A G 0 1 0.997 0.003 0 1
    chr6_24000836_D D I4 0 0 0 0 0 0
    chr6_24000891_I I2 D 0 0 0 0 0 1
    rs16888446 A C 0 0 0 0 0 0

第1-3列是标识符。不应对这些操作执行任何操作,只需将它们原样复制到输出文件中即可。对于其余列,需要将它们视为一对列i和列j,并且需要执行以下操作:2 * i + j

伪代码

write first three columns of input file to output

for all i and j in the file, write 2*i + j to output

所需的输出如下所示:

#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0 
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0

我将在一些具有不同总列数的文件上执行此操作,因此我希望循环运行(总列数 - 3)/ 2次迭代,即直到它到达文件的最后一列。 / p>

输入文件大约有900万行~10,000列,因此将文件读入R等程序非常慢。我不确定用于实现此功能的最有效工具(awk?perl?python?),作为新手编码器,我不确定从哪里开始re:解决方案的语法。

3 个答案:

答案 0 :(得分:0)

这是您发布的算法的awk实现,稍微增强以生成您在预期输出中显示的第一行:

$ cat tst.awk
{
    printf "%s %s %s", $1, $2, $3
    c=0
    for (i=4; i<NF; i+=2) {
        printf " %s", (NR>1 ? 2*$i + $(i+1) : ++c)
    }
    print ""
}

$ awk -f tst.awk file
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0

答案 1 :(得分:0)

Python版

#!/usr/bin/env python

from itertools import izip_longest, chain

def chunk(sequence, chunk_size=2):
    """
    list(chunk([1,2,3,4], 2)) => [(1,2),(3,4)]
    """
    # Take advantage of the same iterator being consumed
    # multiple times/sources to do grouping
    return izip_longest(*[iter(sequence)] * chunk_size)

def processor(csv_reader):
    for row in csv_reader:
        # collect the pairs and process them
        processed_pairs = (2*float(i)+float(j) for i, j in chunk(row[3:]))
        # yield back the first 3 element and the processed pairs
        yield list(i for j in (row[0:3], processed_pairs) for i in j)

if __name__ == '__main__':
    import csv, sys
    with open(sys.argv[1], 'rb') as csvfile:
        source = processor(csv.reader(csvfile, delimiter=' '))
        for line in source:
            print line

答案 2 :(得分:-1)

这会按照你的要求行事。它希望输入文件作为命令行上的参数,并将输出发送到STDOUT,如果您愿意,可以将其重定向到文件。

use strict;
use warnings;

while (<>) {

  my @fields = split;
  my @probs = splice @fields, 3;

  if (/^#/) {
    push @fields, 1 .. @probs / 2;
  }
  else {
    while (@probs >= 2) {
      my ($i, $j) = splice @probs, 0, 2;
      push @fields, $i + $i + $j;
    }
  }

  print "@fields\n";
}

<强>输出

#SNP A1 A2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0