正则表达式匹配重复的行首字符串并删除重复

时间:2014-09-29 22:38:27

标签: regex perl sublimetext2

我是正则表达式的新手,我很确定这个问题已在某处得到解答,但我没有成功调整我发现的工作。我正在使用带有重复词条的字典文件,这会导致编译器失败。所以我需要在一行的开头匹配精确的头部单词(所有这些单词不包含诸如“[”和“<”之类的字符)并删除重复。但是文件中有很多很多重复的单词,所以我想自动替换匹配。这是字典中的一个例子:

aGga
<© aGga @>
[m1]aṅgá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim aṅga,[/b] how much more?[/trn][/m]

aGga
<© aGga @>
[m1]áṅga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

在这里,我需要匹配相同的头部单词(“aGga”),然后删除第二个,第三个等实例(第二个“aGga”)以及它们的后续行(恰好发生在&lt;和&gt; [“&lt;©aGga @&gt;”],产生所需的输出:

aGga
<© aGga @>
[m1]aṅgá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim aṅga,[/b] how much more?[/trn][/m]
[m1]áṅga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

我已经看过3个词条实例,所以我需要寻找的不仅仅是重复任何给定的词条。

到目前为止我的尝试(例如基于this question的“^(。+?\ s)”)只是在匹配相同的词条时返回太多。我主要使用Sublime Text中的正则表达式查找和替换函数,但很乐意以任何可能的方式执行此操作。我知道这对于正则表达式专家来说可能非常简单和无聊,所以感谢您花时间帮助新手。

2 个答案:

答案 0 :(得分:4)

perl的一种方式:

my $data = 'aGga
<© aGga @>
[m1]aṅgá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim aṅga,[/b] how much more?

aGga
<© aGga @>
[m1]áṅga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i]

aGga
<© aGga @>
[m2][trn][i]m. pl. No of a people and their country.[/i]

gubo
<© gubo @>
kjhkjhkj hkjhk jhk kjhkjh khk hkjh kj hkj';
$data =~ s/^
(?|
    \G(?!\A) ([^[<\s]+) \R <©\ \1\ @>  # contigous 
  |
    ([^[<\s]+) \R <©\ \1\ @> \K        # new item
)
( (?>\R.+)* )      # block: group 2
(?: \R\R (?= \1 \R <©[^>]+@> $ ) )?
/$2/gmx;
print $data;

答案 1 :(得分:3)

编辑:utf8的一些打开/关闭内容

# Open a temp file for writing as utf8
# Output to this file will be automatically encoded from Perl internal to utf8 octets
# Write the internal string
# Check the file with a utf8 editor
# ---------------------------------------------- 
open (my $out, '>:utf8', 'temp.txt') or die "can't open temp.txt for writing $!";
print $out $internal_string_1;
close $out;


# Open the temp file for readin as utf8
# All input from this file will be automatically decoded as utf8 octets to Perl internal
# Read/decode to a different internal string
# ----------------------------------------------
open (my $in, '<:utf8', 'temp.txt') or die "can't open temp.txt for reading $!";
$/ = undef;
my $internal_string_2 = <$in>;
close $in;

抱歉这么久。
这是一种方式,它使用带回调的全局替换 为此,块必须是顺序的。

如果块不是顺序的,则必须扩展解决方案。

 # /((?<=^)\s*)^([^<\[\n]+?)(\s*\n\s*<.*>.*(?:\n|$))/

 (                             # (1 start), Ws trim
      (?<= ^ )
      \s* 
 )                             # (1 end)
 ^                             # BOL
 ( [^<\[\n]+? )                # (2), Head
 (                             # (3 start), Angle head
      \s* \n \s* < .* > .* 
      (?: \n | $ )                  # Newline or EOL
 )                             # (3 end)

Perl示例:

use strict;
use warnings;

$/ = undef;
#my $filehandle = open(..);
#my $data = <$filehandle>;

my $data = <DATA>;


my $lasthead = "";


sub StripDupHead
{
   my ($wstrim, $head, $angle_head ) = @_;
   if ( $head eq $lasthead ) {
      return "";
   }
   $lasthead = $head;
   return $wstrim . $head . $angle_head;
}

$data =~ s/((?<=^)\s*)^([^<\[\r\n]+?)(\s*\r?\n\s*<.*>.*(?:\r?\n|$))/StripDupHead($1,$2,$3)/emg;

print $data, "\n";
# print $filehandle $data, "\n";
# close ($filehandle);

__DATA__

aGga
<© aGga @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

aGga
<© aGga @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

bGgb
<© bGgb @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

cGgc
<© cGgc @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

cGgc
<© cGgc @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

cGgc
<© cGgc @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

输出:

aGga
<© aGga @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]

bGgb
<© bGgb @>
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]

cGgc
<© cGgc @>
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]
[m1]a?gá [/m]
[m2][trn][i]pel. ¤1.¤ emphatic[/i]: just, only; especially; ¤2,¤ [i]exhortative[/i]: [i]w. voc. or impv.[/i]; ¤3.¤ [i]intr.[/i]: [/trn][/m]
[m2][trn][b]kim a?ga,[/b] how much more?[/trn][/m]
[m1]á?ga [/m]
[m2][trn][i]m. pl. No of a people and their country.[/i][/trn][/m]