Question

我在Perl脚本中使用了grep，并且我试图grep我给出的确切关键字。问题是“ -w”无法将“-”符号识别为分隔符。

示例：假设我有这两条记录：

A1BG    0.0767377011073753
A1BG-AS1    0.233775553296782

如果我给 grep -w“ A1BG” 它会返回它们两个，但我只想要一个。

有什么建议吗？预先非常感谢。

PS。

这是我的全部代码。输入文件是一个分为两列的选项卡。因此，我想为每个基因保持唯一的价值。如果我有多个记录，我会计算平均值。

#!/usr/bin/perl
use strict;
use warnings;

#Find the average fc between common genes
sub avg {
my $total;
$total += $_ foreach @_;
   return $total / @_;
}

my @mykeys = `cat G13_T.txt| awk '{print \$1}'| sort -u`;
foreach (@mykeys)
{
    my @TSS = ();

    my $op1 = 0;

    my $key = $_;
    chomp($key);
    #print "$key\n";
    my $command = "cat G13_T.txt|grep -E '([[:space:]]|^)$key([[:space:]]|\$)'";
    #my $command = "cat Unique_Genes/G13_T.txt|grep -w $key";
    my @belongs= `$command`;
    chomp(@belongs);
    my $count = scalar(@belongs);
    if ($count == 1) {
            print "$belongs[0]\n";
    }
    else {
            for (my $i = 0; $i < $count; $i++) {
                    my @token = split('\t', $belongs[$i]);
                    my $lfc = $token[1];
                    push (@TSS, $lfc);
            }
            $op1 = avg(@TSS);
            print $key ."\t". $op1. "\n";
    }
}

Answer 1

您可以将POSIX ERE正则表达式与grep一起使用，如下所示：

grep -E '([[:space:]]|^)A1BG([[:space:]]|$)' file

仅返回匹配项（不匹配行）：

grep -Eo '([[:space:]]|^)A1BG([[:space:]]|$)' file

详细信息

([[:space:]]|^)-第1组：空格或行首
A1BG-子字符串
([[:space:]]|$)-第2组：空格或行尾

Answer 2

如果我在注释中得到了澄清，则目的是在第一列中找到唯一名称的值的平均值（第二列）。这样就不需要外部工具了。

逐行读取文件并为每个名称累加值。名称唯一性是通过使用哈希（以名称为键）来授予的。与此同时还跟踪他们的计数

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 filename\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my %results;

while (<$fh>) {
    #my ($name, $value) = split /\t/;
    my ($name, $value) = split /\s+/;  # used for easier testing

    $results{$name}{value} += $value;
    ++$results{$name}{count};
}

foreach my $name (sort keys %results) { 
    $results{$name}{value} /= $results{$name}{count} 
        if $results{$name}{count} > 1;

    say "$name => $results{$name}{value}";
}

文件处理后，每个累加值除以其计数并被其覆盖，因此，如果计数为> 1（作为效率的一小部分度量），则按其平均值（/=除以并赋值））。

如果知道为每个名称找到的所有值有任何用处，则将它们存储在每个键的arrayref中，而不是添加它们

while (<$fh>) {
    #my ($name, $value) = split /\t/;
    my ($name, $value) = split /\s+/;  # used for easier testing

    push @{$results{$name}}, $value;
}

现在我们不需要计数，因为它是由array（ref）中元素的数量给出的

use List::Util qw(sum);

foreach my $name (sort keys %results) {
    say "$name => ", sum(@{$results{$name}}) / @{$results{$name}};
}

请注意，由于存储了所有值，因此以这种方式构建的哈希需要与文件大小相当的内存（甚至可能超过文件大小）。

使用显示的两行示例数据进行了测试，重复并在文件中进行了更改。该代码不会以任何方式测试输入，但是希望第二个字段始终是数字。

请注意，没有理由退出程序并使用外部命令。

使用grep -w阻止“ foo”与“ foo-bar”匹配

2 个答案: