输入file1:
col1 col2 col3 col4
ZGLP1 ICAM4 13.27 0.2425
ICAM4 ZGLP1 13.27 0.2425
RRP1B CDH24 20.8 1
ZGLP1 OOEP 18.79 0.3060
ZGLP1 RRP1B 39.62 0.2972
ZGLP1 CDH24 51.21 0.2560
BBCDI DND1 19.44 0.2833
BBCDI SOHLH2 36.61 0.2909
DND1 SOHLH2 18 0.8
输入文件2:
chr8 18640000 18960000 ZGLP1 RRP1B CDH24 #gene number here is not fixed can be #4 #5 or more
chr8 19000000 19080000 BBCDI DND1 SOHLH2 #gene number here is not fixed can be #4 #5 or more
我编写了一个代码,它将file1的col1和col2与file2的每一行进行比较,这样,如果任何一对落在file2行的任何地方,那么程序应该打印"染色体pos1 pos2和匹配的内容file1的值为
输出文件:
chr8 18640000 18960000 ZGLP1 RRP1B 39.62 0.2972
chr8 18640000 18960000 ZGLP1 CDH24 51.21 0.2560
chr8 18640000 18960000 RRP1B CDH24 20.8 1
chr8 19000000 19080000 BBCDI DND1 19.44 0.2833
chr8 19000000 19080000 BBCDI SOHLH2 36.61 0.2909
chr8 19000000 19080000 DND1 SOHLH2 18 0.8
到目前为止,我已经尝试了这个,但由于我的输入文件非常庞大(2GB),所以需要花费很多时间。
我的perl代码
open( AB, "file1" ) || die("cannot open");
open( BC, "file2" ) || die("cannot open");
open( OUT, ">output.txt" );
@file = <AB>;
chomp(@file);
@data = <BC>;
chomp(@data);
foreach $fl (@file) {
if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) {
$one = $1;
$two = $2;
$thr = $3;
$for = $4;
}
foreach $line (@data) {
if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) {
$chr = $1;
$pos1 = $2;
$pos2 = $3;
}
if ( $line =~ /$one/ ) {
if ( $line =~ /$two/ ) {
print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, "\n";
}
}
}
}
答案 0 :(得分:1)
加快代码速度的几种方法:
首先读入并解析文件1并创建索引:
my %ix;
while (<AB>) {
# skip the first line (with the column headers)
next if $. == 1;
chomp;
# assuming that the data is tab-separated; if not, you can run split /\s+/
my @arr = split "\t";
# create a hash with structure $ix{col1}{col2} = "col3 col4"
$ix{ $arr[0] }{ $arr[1] } = $arr[2] . "\t" . $arr[3];
}
现在在文件2中读取,一次一行,并查找匹配项:
while (<BC>) {
chomp;
# initialise a set of variables all at once
# assumes the data is tab-delimited; if it isn't, use split /\s+/
my ($chr, $pos1, $pos2, $g1, $g2, $g3) = split "\t";
# $g1, $g2, and $g3 are the three IDs on the line. This code assumes they will
# always be in the order that they appear in file 1.
# look for $g1 in our index. if ( $ix{$g1} ) is shorthand for checking if a
# variable is defined and is non-zero.
if ( $ix{$g1} ) {
# now, for each of $g2 and $g3
foreach my $g ($g2, $g3) {
# ... check whether we've got an index entry where it is the second key
if ( $ix{$g1}{$g} ) {
# print out the data joined by tabs
print OUT join("\t", $chr, $pos1, $pos2, $g1, $g, $ix{$g1}{$g}) . "\n";
}
}
}
# do the same check for $g2 and $g3. We have to check whether $ix{$g2} exists
# first as if we check $ix{$g2}{$g3} directly and $ix{$g2} DOESN'T exist,
# Perl will create it. This is known as autovivification.
if ($ix{$g2} && $ix{$g2}{$g3}) {
print OUT join("\t", $chr, $pos1, $pos2, $g2, $g3, $ix{$g2}{$g3}) . "\n";
}
}
答案 1 :(得分:1)
$ cat tst.awk
NR==FNR {
if (NR>1)
file1[$1,$2] = $0
next
}
{
for (i=3; i<=NF; i++)
for (j=3; j<=NF; j++)
if ( ($i,$j) in file1 )
print $1, $2, $3, file1[$i,$j]
}
$
$ awk -f tst.awk file1 file2
chr8 18640000 18960000 ZGLP1 RRP1B 39.62 0.2972
chr8 18640000 18960000 ZGLP1 CDH24 51.21 0.2560
chr8 18640000 18960000 RRP1B CDH24 20.8 1
chr8 19000000 19080000 BBCDI DND1 19.44 0.2833
chr8 19000000 19080000 BBCDI SOHLH2 36.61 0.2909
chr8 19000000 19080000 DND1 SOHLH2 18 0.8