Question

我正在尝试编写一个perl脚本来处理一些3 + gb的文本文件，其结构如下：

1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

我想执行两项操作：

计算每行的分隔符数量并将其与静态数字（即5）进行比较，超出该数量的那些行应输出到file.control。
通过子字符串（$ line，0,7）删除文件上的重复项 - 前7个数字，但我想保留顺序。我想在file.output中输出。

我用简单的shell脚本编写了这个（只是bash），但是处理时间太长，调用perl one line的相同脚本更快，但我对纯粹用perl执行此操作感兴趣。

我到目前为止的代码是：

open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;

while ($line_in = <$file_hndl_in>)
{
    # Calculate n. of delimiters
    my $delim_cur_line = $line_in =~ y/"$delimiter"//;
    # print "$commas \n" 

   if ( $delim_cur_line != $delim_amnt_per_line )
   {
      print {$file_hndl_ot_control} "$line_in";  
   }

   # Remove duplicates by substr(0,7) maintain order
   my substr_in = substr $line_in, 0, 11;
   print if not $lines{$substr_in}++;

}

我希望file.output文件看起来像

1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

和file.control文件看起来像：

（假设分隔符控制号为6）

4352342xx23232xxx345545x45454x23232xxx

有人可以帮助我吗？谢谢。

发布修改：试用代码

my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;



open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");

while ( <> ) {

    my $count = ($_ =~ y/x//);
    print  "$count \n";
    # print $_;

    if ( $count != $delim_amnt_per_line )
    {
        print fh1 $_;
    }


    my ($prefix) = substr $_, 0, 7;
    next if $seen{$prefix}++;

    print fh2;
}

我不知道我是否应该在这里发布新代码。但我根据你的例子尝试了上述内容。令我感到困惑的是（我在perl中仍然很新）是它不会输出到任何文件句柄，但如果我正如你所说的那样从命令行重定向，那就完美了。问题是我需要输出2个不同的文件。

Answer 1

看起来具有相同七个字符前缀的条目可能出现在文件中的任何位置，因此必须使用散列来跟踪已经遇到的那些。使用3GB的文本文件可能会导致perl进程内存不足，在这种情况下需要采用不同的方法。请试一试，看看它是否在栏下

tr///运算符（与y///相同）并不接受其字符列表的变量，因此我使用eval创建子例程{{ 1}}将计算delimiters()

中$delimiter的出现次数

通常最简单的方法是将输入文件作为参数传递到命令行，并根据需要重定向输出。这样，您可以在不编辑源代码的情况下在不同文件上运行程序，这就是我编写此程序的方式。您应该将其作为

运行

$_

$ perl filter.pl my_input.file > my_output.file

输出

use strict;
use warnings 'all';

my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;

eval "sub delimiters { tr/$delimiter// }";

while ( <> ) {
    next if delimiters() == $delim_amnt_per_line;

    my ($prefix) = substr $_, 0, 7;
    next if $seen{$prefix}++;

    print;
}

通过子字符串删除文件上的重复行 - 保留顺序（PERL）

1 个答案:

输出