Question

我有一个简单的perl脚本，它在以下几行中执行了许多文本替换：

#!/usr/bin/perl
{
open(my $in, "<", "Texts.txt") or die "No input: $!";
open(my $out, ">",  "TeXed/Texts.tex") or die "No output directory: $!";
LINE: while (<$in>) {
    s/(txt@)(.*)(?<!\t|\[)\[(.*)/\1\2\\ovl{}\3/g;# 
    # there are a bunch of other replacements like the above
    print $out $_ ; 
    }
}

到目前为止一切顺利。我正在运行此脚本的文本被组织成块（并不总是具有相同的长度）。每个块以相同的标识符（txt @）开头，然后是唯一的标签。每个标签都以＃。开头我想要实现的是删除所有重复的标签 - 基本上我只想保留标签的每个第一个实例并替换/删除所有后续标签，直到标签更改为止。在下面的示例中，要替换/删除的内容采用粗体。

txt @＃Label1一些文字
更多文字
更多文字

txt @＃ Label1 其他一些文字
更多文字
更多文字
更多文字

txt @＃ Label1 一些随机文字
更多文字
更多文字

txt @＃Label2一些文字
更多文字
更多文字
更多文字

txt @＃Label1一些文字
更多文字
更多文字

txt @＃Label3一些文字
更多文字
更多文字

txt @＃ Label3 一些文字
更多文字
更多文字

txt @＃Label1一些文字
更多文字
更多文字

等

很抱歉这个长期的例子 - 我无法想出一个更好的解释方法。

所以我想要删除所有重复的Label1，Label2等，但不要修改同一行上的其余文本（一些文本，一些文本），也要修改后续行。后续行的数量并不总是相同（因此不是每个必须替换的第n行）。

perl有可能吗？还是其他任何方式？（我没有和perl结婚，如果用另一种语言更容易，我会很乐意尝试 - 我不是程序员，虽然这样详细的说明会受到高度赞赏）。

Answer 1

介绍'当前标签' - 最新的标签 - 并跟踪它。一旦带有标签的线比较：如果它是相同的，它重复，所以删除它，否则替换它，我们有新的'当前'。

处理逐行进行。或者，可以一次读取整个块以启用每块处理，这可能更方便。最终会显示此代码。

use warnings;
use strict;

open my $fh_out, '>', 'new_text_label.txt';
open my $fh_in, '<', 'text_label.txt';

# Our current (running) label
my $curr_label = '';

while (<$fh_in>)  
{
    # If line with label fetch it otherwise (print and) skip
    my ($label) = $_ =~ m/txt@#(\w+)/;
    if (not $label) {
        # ... process non-label line as needed ...
        print $fh_out $_;
        next;
    }       
    # Delete if repeated (matching the current), reset if new
    if ($curr_label eq $label) {
        s/(txt@)(?:#\w+)(.*)/$1$2/;
    }   
    else {
        $curr_label = $label;
    }   
    # ... process label-line as needed ...
    print $fh_out $_;
}

这会产生所需的文件。具有或不具有标签的线的处理是分开的，如果进一步处理对它们不同可能是好的。或者，标签行的预处理可以在一个地方完成，如果进一步处理不区分有或没有标签的行，则更好。

while (<$fh_in>) 
{
     # If this is the label line, process it: delete or replace the label
     if (my ($label) = $_ =~ m/txt@#(\w+)/) {
        # Delete if repeated (matching the current), reset if new
        if ($curr_label eq $label) {
            s/(txt@)(?:#\w+)(.*)/$1$2/;
        }   
        else {
            $curr_label = $label;
        }
     }
     # The label is now fixed as needed. Process lines normally ...
     print $fh_out $_;
}

这取代了上面的while循环，其余代码是相同的。

源于最初发布的内容，评论

以下是代码中的更改，以便一次读取整个块，这对于可以利用变量中的整个文本块的处理是有益的。请注意，块包含新行（因此正则表达式可能需要/s等）。为了实现可能的批量处理，所有块也首先被读入数组。

my @blocks = do { 
    # Set record separator to empty line to read blocks
    local $/ = "\n\n";
    open my $fh_in, '<', 'text_label.txt';
    <$fh_in>;    
};

# Our current (running) label
my $curr_label = '';

foreach my $bl (@blocks) 
{
     # The label pre-processing is exactly the same as above
     # Other processing can now utilize having the whole block in $bl
}

用perl脚本替换特定文本的重复

1 个答案: