我有一系列制表符分隔的文件(最多16个)。每个看起来像:
gi|100816391|ref|NM_003934.1| 1 162 192
gi|104485445|ref|NM_138572.2| 7 2316 2376
gi|105554499|ref|NR_002791.2| 1 2792 2867
每个文件最多可包含2000万行。其中一些线路是独一无二的;其中一些将重复多次。我需要做的是创建一个表,列出每个唯一的行以及该行在每个文件中出现的频率。输出理想情况如下:
"Gene Name" \t "Read start" \t "alignstart" \t "alignend" \t "freq in file1" \t "freq in file2" \t etc.
gi|100816391|ref|NM_003934.1| \t 1 \t 162 \t 192 \t 10000 \t 200
gi|104485445|ref|NM_138572.2| \t 7 \t 2316 \t 2376 \t 2 \t 500
等
我在编程方面相对较新,我正在尽快加快速度,专注于perl。我还没有看到任何与我正在做的事情相近的帖子,我认为我可以修改它们,但如果您认为以前已经解决了这个问题,我很乐意接受建议。
答案 0 :(得分:0)
尝试使用这类东西让你前进:
File1中:
gi|100816391|ref|NM_003934.1| 1 162 192
gi|104485445|ref|NM_138572.2| 7 2316 2376
gi|105554499|ref|NR_002791.2| 1 2792 2867
文件2:
gi|100816391|ref|NM_003934.1| 1 162 192 # The same as in file file
gi|104485445|ref|NM_111111.2| 7 2316 2376 # Different from file 1
gi|105554499|ref|NR_222222.2| 1 2792 2867 # Different from file 1
代码:
#!/usr/bin/perl
use warnings;
use strict;
open my $input, '<', 'in.txt';
my (%file1, %seen);
while (<$input>){
chomp;
my @split = split(/\t/);
$file1{$split[0]} = $_;
$seen{$_}++; # Count each time you see an identical line in file
}
open my $input2, '<', 'in.2.txt';
my %file2;
while (<$input2>){
chomp;
my @split = split(/\t/);
$file1{$split[0]} = $_;
$seen{$_}++;
}
foreach my $key (keys %seen){
print "$key\tfreq: $seen{$key}\n"; # Print out all lines with their frequency of occurrence
}
输出:
gi|105554499|ref|NR_222222.2| 1 2792 2867 freq: 1
gi|100816391|ref|NM_003934.1| 1 162 192 freq: 2
gi|105554499|ref|NR_002791.2| 1 2792 2867 freq: 1
gi|104485445|ref|NM_111111.2| 7 2316 2376 freq: 1
gi|104485445|ref|NM_138572.2| 7 2316 2376 freq: 1
答案 1 :(得分:0)
您可以使用awk执行此操作:
awk '{a[$0]++}END{for (i in a){print i,a[i]}}' yourfile
当遇到每一行时,由该行索引的数组a []的元素将递增以计算该行的出现次数。然后在最后,打印[]的键和内容。
因此,在第一行之后,数组a []将如下所示:
a["gi|100816391|ref|NM_003934.1| 1 162 192"]=1
在第二行之后,数组a []将如下所示:
a["gi|104485445|ref|NM_138572.2| 7 2316 2376"]=1
如果您要做16,请将上述内容置于循环中:
#!/usr/bin/bash
for f in *.csv
do
echo Processing file "$f"
awk '{a[$0]++}END{for (i in a){print i,a[i]}}' "$f"
done
答案 2 :(得分:0)
假设文件的内容是(这里有2个文件):
my %files = (
file1 => [
'gi|100816391|ref|NM_003934.1| 1 162 192',
'gi|104485445|ref|NM_138572.2| 7 2316 2376',
'gi|105554499|ref|NR_002791.2| 1 2792 2867',
'gi|100816391|ref|NM_003934.1| 1 162 192',
'gi|104485445|ref|NM_138572.2| 7 2316 2376',
],
file2 => [
'gi|104485445|ref|NM_138572.2| 7 2316 2376',
'gi|105554499|ref|NR_002791.2| 1 2792 2867',
'gi|105554499|ref|NR_002791.2| 1 2792 2867',
'gi|104485445|ref|NM_138572.2| 7 2316 2376',
]
);
一块剧本:
my %data;
# Here you have to loop on all your files
# and do open ... while() ... instead of this foreach loop
foreach my $file (keys %files) {
foreach (@{$files{$file}}) {
$data{$_}{$file}++;
}
}
foreach my $data (keys(%data)) {
my $freq = $data;
foreach my $file (sort keys %files) {
$freq .= "\t$file:" . (exists$data{$data}{$file} ? $data{$data}{$file} : 0);
}
print $freq,"\n";
}
<强>输出:强>
gi|105554499|ref|NR_002791.2| 1 2792 2867 file1:1 file2:2
gi|100816391|ref|NM_003934.1| 1 162 192 file1:2 file2:0
gi|104485445|ref|NM_138572.2| 7 2316 2376 file1:2 file2:2
答案 3 :(得分:0)
M42的答案是我最容易理解并可以修改的答案;我会让有实际编程经验的人说这是否真的是最好的方法。无论如何,我稍微修改了他的程序以适应我的情况。有效的最终计划是:
$sourcefolder = "/home/guests/etc";
$destfolder = "/home/guests/etc";
$sourceextension = "fwd"; #the extension of the files I want to change
my %data;
opendir DIR, ($sourcefolder) || die "Cannot open directory $!";
while($filename = readdir(DIR) )
{
if($filename =~ /.*.$sourceextension/){
print "Now processing: $filename\n";
$sample = (split /\./, $filename)[0]; #this is to get rid of the extension on the source files
$outfile=("combine_sum-out");
push (@samples, $sample);
if (! (open (IN, "<$sourcefolder/$filename"))) { die "Can't open $filename: $!\n"; }
if (! (open (OUT, ">>$destfolder/$outfile"))) { die "Can't write to $outfile: $!\n"; }}
while(chomp($line=<IN>))
{
$data{$line}{$sample}++; #creates the hash of a hash
}
}
foreach my $data (keys(%data)) {
my $freq = $data;
foreach my $sa (@samples) {
$freq .= "\t$sa:" . (exists$data{$data}{$sa} ? $data{$data}{$sa} : 0);
}
print OUT ($freq,"\n");
}
我最终可能会修改最后一个块,以便只打印来自$ data {$ data} {$ sa}的值,并将原始$ data打印为开头的标题行。
感谢大家的帮助!