Question

任何人都可以建议一种简单的方法来实现这一目标。我有几个文件以扩展名.vcf结尾。我将以两个文件为例在下面的文件中，我们对

感兴趣

文件1：

38  107 C   3   T   6   C/T
38  241 C   4   T   5   C/T
38  247 T   4   C   5   T/C
38  259 T   3   C   6   T/C
38  275 G   3   A   5   G/A
38  304 C   4   T   5   C/T
38  323 T   3   A   5   T/A

文件2：

38  107 C   8   T   8   C/T
38  222 -   6   A   7   -/A
38  241 C   7   T   10  C/T
38  247 T   7   C   10  T/C
38  259 T   7   C   10  T/C
38  275 G   6   A   11  G/A
38  304 C   5   T   12  C/T
38  323 T   4   A   12  T/A
38  343 G   13  A   5   G/A

索引文件：

索引文件是根据文件1和文件2中的唯一位置创建的。我已准备好作为索引文件。现在我需要读取所有文件并根据这里的位置解析数据并写入列。从上面的文件中，我们感兴趣的是4th（Ref）和第6（Alt）列。另一个挑战是相应地命名标题。所以输出应该是这样的。

Position    File1_Ref   File1_Alt   File2_Ref   File2_Alt
107 3   6   8   8
222         6   7
241 4   5   7   10
247 4   5   7   10
259 3   6   7   10
275 3   5   6   11
304 4   5   5   12
323 3   5   4   12
343         13  5

Answer 1

您可以使用join命令执行此操作：

# add file1
$ join -e' ' -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n index) <(sort -n -k2 file1) > file1.merged

# add file2
$ join -e' ' -1 1 -2 2 -a 1 -o 0,1.2,1.3,2.4,2.6 file1.merged <(sort -n -k2 file2) > file2.merged

# create the header
$ echo "Position File1_Ref File1_Alt File2_Ref File2_Alt" > report
$ cat file2.merged >> report

输出：

$ cat report

Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222     6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
323 4 12 4 12
343 13 5 13 5

<强>更新

这是一个可用于组合多个文件的脚本。

已做出以下假设：

索引文件已排序
vcf文件在第二列
文件名中没有空格（或任何其他特殊字符）

将以下脚本保存到文件中，例如report.sh并在没有包含文件的目录中的任何参数的情况下运行它。

#!/bin/bash

INDEX_FILE=index    # the name of the file containing the index data
REPORT_FILE=report  # the file to write the report to
TMP_FILE=$(mktemp)  # a temporary file

header="Position"   # the report header
num_processed=0     # the number of files processed so far 

# loop over all files beginning with "file". 
# this pattern can be changed to something else e.g. *.vcf
for file in file*
do
    echo "Processing $file"
    if [[ $num_processed -eq 0 ]]
    then
        # it's the first file so use the INDEX file in the join
        join -e' ' -t, -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n "$INDEX_FILE") <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
    else
        # work out the output fields
        for ((outputFields="0",j=2; j < $((2 + $num_processed * 2)); j++))
        do
            outputFields="$outputFields,1.$j"
        done
        outputFields="$outputFields,2.4,2.6"

        # join this file with the current report
        join -e' ' -t, -1 1 -2 2 -a 1 -o "$outputFields" "$REPORT_FILE" <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
    fi
    ((num_processed++))
    header="$header,File${num_processed}_Ref,File${num_processed}_Alt"
    mv "$TMP_FILE" "$REPORT_FILE"
done

# add the header to the report
echo "$header" | cat - "$REPORT_FILE"  > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"

# the report is a csv file. Uncomment the line below to make it space-separated.
# tr ',' ' ' < "$REPORT_FILE"  > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"

Answer 2

此Perl解决方案将处理1个或多个（50）文件。

#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp qw/ slurp /;
use Text::Table;

my $path = '.';
my @file = qw/ o33.txt o44.txt /;
my @position = slurp('index.txt') =~ /\d+/g;
my %data;

for my $filename (@file) {
    open my $fh, "$path/$filename" or die "Can't open $filename $!";
    while (<$fh>) {
        my ($pos, $ref, $alt) = (split)[1, 3, 5];
        $data{$pos}{$filename} = [$ref, $alt];
    }
    close $fh or die "Can't close $filename $!";
}

my @head;
for my $file (@file) {
    push @head, "${file}_Ref", "${file}_Alt";
}

my $tb = Text::Table->new( map {title => $_}, "Position", @head);

for my $pos (@position) {
    $tb->load( [
                $pos,
                map $data{$pos}{$_} ? @{ $data{$pos}{$_} } : ('', ''), @file
               ]
    );
}
print $tb;

读取并解析多个文本文件

2 个答案: