如何组合文件,显示每个条目的频率

时间:2014-02-11 11:40:12

标签: perl merge frequency

我有一系列制表符分隔的文件(最多16个)。每个看起来像:

gi|100816391|ref|NM_003934.1|   1   162 192

gi|104485445|ref|NM_138572.2|   7   2316    2376

gi|105554499|ref|NR_002791.2|   1   2792    2867

每个文件最多可包含2000万行。其中一些线路是独一无二的;其中一些将重复多次。我需要做的是创建一个表,列出每个唯一的行以及该行在每个文件中出现的频率。输出理想情况如下:

"Gene Name" \t "Read start" \t "alignstart" \t "alignend" \t "freq in file1" \t "freq in file2" \t etc.

gi|100816391|ref|NM_003934.1| \t 1 \t 162 \t 192 \t 10000 \t 200

gi|104485445|ref|NM_138572.2| \t 7 \t 2316 \t 2376 \t 2 \t 500

我在编程方面相对较新,我正在尽快加快速度,专注于perl。我还没有看到任何与我正在做的事情相近的帖子,我认为我可以修改它们,但如果您认为以前已经解决了这个问题,我很乐意接受建议。

4 个答案:

答案 0 :(得分:0)

尝试使用这类东西让你前进:

File1中:

gi|100816391|ref|NM_003934.1|   1       162     192
gi|104485445|ref|NM_138572.2|   7       2316    2376
gi|105554499|ref|NR_002791.2|   1       2792    2867

文件2:

gi|100816391|ref|NM_003934.1|   1       162     192 # The same as in file file
gi|104485445|ref|NM_111111.2|   7       2316    2376 # Different from file 1
gi|105554499|ref|NR_222222.2|   1       2792    2867 # Different from file 1

代码:

#!/usr/bin/perl
use warnings;
use strict; 

open my $input, '<', 'in.txt';

my (%file1, %seen);
while (<$input>){
    chomp;
    my @split = split(/\t/);
    $file1{$split[0]} = $_;
    $seen{$_}++; # Count each time you see an identical line in file
}

open my $input2, '<', 'in.2.txt';

my %file2;
while (<$input2>){
    chomp;
    my @split = split(/\t/);
    $file1{$split[0]} = $_;
    $seen{$_}++; 
}


foreach my $key (keys %seen){
    print "$key\tfreq: $seen{$key}\n"; # Print out all lines with their frequency of occurrence
}

输出:

gi|105554499|ref|NR_222222.2|   1   2792    2867    freq: 1
gi|100816391|ref|NM_003934.1|   1   162 192 freq: 2
gi|105554499|ref|NR_002791.2|   1   2792    2867    freq: 1
gi|104485445|ref|NM_111111.2|   7   2316    2376    freq: 1
gi|104485445|ref|NM_138572.2|   7   2316    2376    freq: 1

答案 1 :(得分:0)

您可以使用awk执行此操作:

awk '{a[$0]++}END{for (i in a){print i,a[i]}}' yourfile

当遇到每一行时,由该行索引的数组a []的元素将递增以计算该行的出现次数。然后在最后,打印[]的键和内容。

因此,在第一行之后,数组a []将如下所示:

a["gi|100816391|ref|NM_003934.1|   1   162 192"]=1

在第二行之后,数组a []将如下所示:

a["gi|104485445|ref|NM_138572.2|   7   2316    2376"]=1

如果您要做16,请将上述内容置于循环中:

#!/usr/bin/bash
for f in *.csv
do
  echo Processing file "$f"
  awk '{a[$0]++}END{for (i in a){print i,a[i]}}' "$f"
done

答案 2 :(得分:0)

假设文件的内容是(这里有2个文件):

my %files = (
    file1 => [
        'gi|100816391|ref|NM_003934.1|   1   162 192',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|100816391|ref|NM_003934.1|   1   162 192',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
    ],
    file2 => [
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|105554499|ref|NR_002791.2|   1   2792    2867',
        'gi|104485445|ref|NM_138572.2|   7   2316    2376',
    ]
);

一块剧本:

my %data;
# Here you have to loop on all your files
# and do open ... while() ... instead of this foreach loop
foreach my $file (keys %files) {
    foreach (@{$files{$file}}) {
        $data{$_}{$file}++;
    }
}
foreach my $data (keys(%data)) {
    my $freq = $data;
    foreach my $file (sort keys %files) {
        $freq .= "\t$file:" . (exists$data{$data}{$file} ? $data{$data}{$file} : 0);
    }
    print $freq,"\n";
}

<强>输出:

gi|105554499|ref|NR_002791.2|   1   2792    2867    file1:1 file2:2
gi|100816391|ref|NM_003934.1|   1   162 192 file1:2 file2:0
gi|104485445|ref|NM_138572.2|   7   2316    2376    file1:2 file2:2

答案 3 :(得分:0)

M42的答案是我最容易理解并可以修改的答案;我会让有实际编程经验的人说这是否真的是最好的方法。无论如何,我稍微修改了他的程序以适应我的情况。有效的最终计划是:

$sourcefolder = "/home/guests/etc";
$destfolder = "/home/guests/etc";
$sourceextension = "fwd"; #the extension of the files I want to change



my %data;

opendir DIR, ($sourcefolder) || die "Cannot open directory $!";
while($filename = readdir(DIR) )
{
        if($filename =~ /.*.$sourceextension/){ 
            print "Now processing: $filename\n";
            $sample = (split /\./, $filename)[0]; #this is to get rid of the extension on the source files
            $outfile=("combine_sum-out");
            push (@samples, $sample);

    if (! (open (IN, "<$sourcefolder/$filename"))) { die "Can't open $filename: $!\n"; }
    if (! (open (OUT, ">>$destfolder/$outfile"))) { die "Can't write to $outfile: $!\n"; }}



    while(chomp($line=<IN>))
    {
            $data{$line}{$sample}++; #creates the hash of a hash
    }
}

foreach my $data (keys(%data)) {
           my $freq = $data;
           foreach my $sa (@samples) {
               $freq .= "\t$sa:" . (exists$data{$data}{$sa} ? $data{$data}{$sa} : 0);
           }
           print OUT ($freq,"\n");
}

我最终可能会修改最后一个块,以便只打印来自$ data {$ data} {$ sa}的值,并将原始$ data打印为开头的标题行。

感谢大家的帮助!