Question

我有10个文件夹，在每个文件夹中，我有两个文件（CSV，逗号分隔），格式如下。

文件1：

Ensembl Gene ID,Ensembl Transcript ID,Exon Chr Start (bp),Exon Chr End (bp),Exon Rank in Transcript, Transcript count,Gene End (bp) ,Gene Start (bp),Strand
ENSG00000271782,ENST00000607815,50902700,50902978,1,1,50902978,50902700,-1
ENSG00000232753,ENST00000424955,103817769,103817825,1,1,103828355,103817769,1
ENSG00000232753,ENST00000424955,103827995,103828355,2,1,103828355,103817769,1
ENSG00000225767,ENST00000424664,50927141    50927168,1,1,50936822,50927141,1

文件2：

number,Start pos,End Pos
1,41035,41048
3,36738,36751
3,38169,38182
3,40264,40277

我正在尝试将第二个文件与firstfile匹配

第二个文件的colum1中的数字是第一个文件中的密钥记录号。
从第一个文件

所需的输出是：

1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1

我已经开始使用TexT::CSV从第二个阅读，但需要帮助。

use strict;
use warnings;
use lib 'C:/Perl/lib';
use Text::CSV;

my $file1 = "infile1";
open my $fh, "<", $file1 or die "$file1: $!";
my $file2 = "infile2"
open my $fh2, "<", $file2 or die "$file2: $!";

my $csv = Text::CSV->new ({
  binary    => 1, 
  auto_diag => 1,
  });


while (my $row = $csv->getline ($fh2)) {
  print "@$row\n"; # I am stuck in extraction ? do I need to put another while loop for fh1  
  }

close $fh1;
close $fh2;

Answer 1

这个问题的一个有趣的部分是你需要逻辑读取文件1直到它超前于文件2和逻辑读取文件2直到它超前于文件1和逻辑知道当一个人落后于另一个时如何行动，当他们处于平衡状态时

您需要跟踪列表中唯一的基因合奏ID及其序数位置。因此，当您读取file2的第二行时，您将知道如何跳过file1的第二行和第三行，但是当您读取文件1中的第三行和第四行时，也知道不要跳过文件1中的更多内容。

或者你可以将file1读入内存，并创建一个行数组的数组，以便例如

    file1arr[1] = [ $line1 ]
    file1arr[2] = [ $line2, $line3 ]
    file1arr[3] = [ $line4 ]

所以当你遍历文件2时，file1中的所有行都在与file2的数字列对应的数组索引处的一个整齐的小数组中。

然后，它只是迭代file1行数组，拆分它们并构建输出行。

Answer 2

由于双引号中没有逗号，因此您可以在逗号上split而不是使用Text::CSV（这是一个优秀的模块）。鉴于此，以下产生您想要的输出：

use strict;
use warnings;
use autodie;

my ( $num,   %hash )  = 0;
my ( $file1, $file2 ) = qw/inFile1 inFile2/;

open my $fh1, '<', $file1;
while (<$fh1>) {
    next if $. == 1;
    chomp;
    my @fields = split /,/;

    $num++ if !$hash{ $fields[0] }++;
    push @{ $hash{$num} }, [ @fields[ 0, 6 .. 8 ] ];
}
close $fh1;

open my $fh2, '<', $file2;
while (<$fh2>) {
    next if $. == 1;
    chomp;
    my @fields = split /,/;

    if ( my @arr = @{ $hash{ $fields[0] }->[0] } ) {
        splice @arr, 1, 0, @fields[ 1, 2 ];
        print join( ',', $fields[0], @arr ), "\n";
    }
}
close $fh2;

这使用哈希来：1）跟踪看到的基因ID，2）构建数组数组的哈希值（HoAoA）。计数 - 您的“密钥记录” - 在唯一的基因ID上递增，因此＃1会跟踪这些ID，以确保$num仅在基因ID尚未出现时递增。使用数字2（HoAoA），因为存在相同基因ID的多个实例，但在打印中仅使用第一实例的值。（但我确实注意到，第二个跳过＃2，这是多实例Gene ID。）也许你只需要一个数组哈希（HoA），但它的工作方式很好 - 或者你可以只是根据需要修改它。也就是说，如果您不打算使用多个Gene ID信息，则可以简化代码。

数据集输出：

1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1

希望这有帮助！

解析CSV文件并进行匹配

2 个答案: