Question

我有一个制表符分隔的文本文件，如下所示：

contig11 GO:100 other columns of data
contig11 GO:289 other columns of data
contig11 GO:113 other columns of data
contig22 GO:388 other columns of data
contig22 GO:101 other columns of data

另一个像这样：

contig11 3 N
contig11 1 Y
contig22 1 Y
contig22 2 N

我需要将它们组合起来，以便复制其中一个文件的每个“多个”条目，并在另一个文件中填充其数据，以便我得到：

contig11 3 N GO:100 other columns of data
contig11 3 N GO:289 other columns of data
contig11 3 N GO:113 other columns of data
contig11 1 Y GO:100 other columns of data
contig11 1 Y GO:289 other columns of data
contig11 1 Y GO:113 other columns of data
contig22 1 Y GO:388 other columns of data
contig22 1 Y GO:101 other columns of data
contig22 2 N GO:388 other columns of data
contig22 2 N GO:101 other columns of data

我的脚本编写经验很少，但是在“contig11”只在其中一个文件中出现一次，带有哈希/键。但我甚至无法开始思考这个问题！真的很感激如何解决这个问题的一些帮助或提示。

编辑所以我尝试了ikegami的建议（见答案）：但是，这产生了我需要的输出，除了GO：100列以后（$ rest in script ???） - 任何想法我做错了什么？

#!/usr/bin/env/perl

use warnings;

open (GOTERMS, "$ARGV[0]") or die "Error opening the input file with GO terms";
open (SNPS, "$ARGV[1]") or die "Error opening the input file with SNPs";

my %goterm;

while (<GOTERMS>)
{
    my($id, $rest) = /^(\S++)(,*)/s;
    push @{$goterm{$id}}, $rest;
}

while (my $row2 = <SNPS>)
{
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} })
    {
        print("$row2$rest\n");
    }
}

close GOTERMS;
close SNPS;

Answer 1

看看你的输出。它显然是由

制作的

对于第二个文件的每一行，
- 对于具有相同ID的第一个文件的每一行，
  - 打印出合并的行

所以问题是：如何找到第一个文件的行与第二个文件的行具有相同的ID？

答案是：您将第一个文件的行存储在由行的id索引的哈希中。

my %file1;
while (<$file1_fh>) {
   my ($id, $rest) = /^(\S++)(.*)/s;
   push @{ $file1{$id} }, $rest;
}

所以早期的伪代码解析为

while (my $row2 = <$file2_fh>) {
   chomp($row2);
   my ($id) = $row2 =~ /^(\S+)/;
   for my $rest (@{ $file1{$id} }) {
      print("$row2$rest");
   }
}

#!/usr/bin/env perl

use strict;   
use warnings;

open(my $GOTERMS, $ARGV[0])
     or die("Error opening GO terms file \"$ARGV[0]\": $!\n");
open(my $SNPS, $ARGV[1])
     or die("Error opening SNP file \"$ARGV[1]\": $!\n");

my %goterm;
while (<$GOTERMS>) {
    my ($id, $rest) = /^(\S++)(.*)/s;
    push @{ $goterm{$id} }, $rest;
}

while (my $row2 = <$SNPS>) {
    chomp($row2);
    my ($id) = $row2 =~ /^(\S+)/;
    for my $rest (@{ $goterm{$id} }) {
        print("$row2$rest");
    }
}

Answer 2

我将描述你如何做到这一点。你需要每个文件pu到数组（每个自由是一个数组项）。然后你只需要以所需的方式比较这些数组。你需要2个循环。数组/文件的每个记录的主循环，其中包含您将用于驻留的字符串（在您的示例中，它将是第二个文件）。在此循环下，您需要为数组/文件中的每个记录创建另一个循环，其中包含要与之比较的记录。然后用另一个数组的每个recrod检查每个数组记录并处理结果。

foreach my $record2 (@array2) {
    foreach my $record1 (@array1){
        if ($record2->{field} eq $record1->{field}){
            #here you need to create the string which you will show
            my $res_string = $record2->{field}.$record1->{field};
            print "$res_string\n";
        }
    }
}

或者不要使用数组。只需读取文件并将每一行与另一个文件的每一行进行比较。一般的想法是相同的））

用于将2个文件与多个条目组合的Perl脚本

2 个答案: