Question

perl正则表达式匹配＆＃34;字＆＃34;在以下文件名中？

我有一系列文件名，其中一些单词出现不止一次：

john_smith_on_alaska_trip_john_smith_0001.jpg

他妻子的名字是奥尔加，对o有一个变言法，还有一些其他的变音符号;在我的情况下，所有小写，但不仅仅是英语a-z。由于其他原因，.jpg暂时被剥离，在此讨论中可能会被忽略。

我想删除重复的名称/单词。这样的东西在emacs中运行良好：

s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/

运行一次，上面转为：john_smith_on_alaska_trip__smith_0001.jpg

再次：john_smith_on_alaska_trip___0001.jpg

在Perl中，这不起作用，因为\w将_包含为单词字符。更糟糕的是 - 锚点\b不是那些字符，因此不会在_上分开。

我目前的解决方案是用_替换所有，执行契约，然后还原。但是，这似乎是一个基本要求，我觉得我必须遗漏一些东西。

谢谢。

Answer 1

使用Character Class \p{Alpha}和Lookbehind and Lookahead assertions代替字边界，以确保每个字都是整个单词而不是子字符串：

use strict;
use warnings;

my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";

1 while $file =~ s{
    (?<!\p{Alpha}) ( \p{Alpha}++ )     # Word surrounded by non-word chars
    .* \K                              # Keep everything before this point
    (?<!\p{Alpha}) \1 (?!\p{Alpha})    # Strip duplicate word 
}{}x;

print "$file\n";

输出：

john_smith_on_alaska_trip___0001_johnsmith.jpg

Live Demo

Answer 2

您可以使用split将字符串分成其组成部分，然后使用哈希检查重复项：

use strict;
use warnings;

my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;

my %count;
foreach my $word (@words) {
    $word = '' if ++$count{$word} > 1;
}

print join('_', @words), "\n";

输出：

john_smith_on_alaska_trip___0001.jpg

或者，您可以使用List::MoreUtils中的uniq来获取唯一字词，但这会通过消除trip之后的连续下划线来略微改变您的输出：

use strict;
use warnings;

use List::MoreUtils 'uniq';

my $string = 'john_smith_on_alaska_trip_john_smith_0001.jpg';
my @words = split /_/, $string;

print join('_', uniq @words), "\n";

输出：

john_smith_on_alaska_trip_0001.jpg

如何通过正则表达式识别“文本”单词？

2 个答案:

输出：

输出：