简单的东西

Question

我有一个大单词列表文件，每行一个单词。我想用重复的字母过滤掉单词。

INPUT:
  abducts
  abe
  abeam
  abel
  abele

OUTPUT:
  abducts
  abe
  abel

我想用Regex（grep或perl或python）来做这件事。这可能吗？

Answer 1

编写一个匹配做具有重复字母的单词的正则表达式会更容易，然后否定匹配：

my @input = qw(abducts abe abeam abel abele);
my @output = grep { not /(\w).*\1/ } @input;

（此代码假定@input每个条目包含一个单词。）但这个问题不一定最好用正则表达式解决。

我已经在Perl中给出了代码，但它很容易被翻译成支持反向引用的任何正则表达式，包括grep（也有-v开关来否定匹配）。< / p>

Answer 2

$ egrep -vi '(.).*\1' wordlist

Answer 3

可以使用正则表达式：

import re

inp = [
    'abducts'
,   'abe'
,   'abeam'
,   'abel'
,   'abele'
]

# detect word which contains a character at least twice
rgx = re.compile(r'.*(.).*\1.*') 

def filter_words(inp):
    for word in inp:
        if rgx.match(word) is None:
            yield word

print list(filter_words(inp))

Answer 4

简单的东西

尽管不正确的抗议，这是正则表达式无法实现的，但肯定是。

虽然@cjm公正表示，否定正面匹配要比将单面形式表示为负面更容易，但这样做的模型已经充分为人所知，它只是一个插件的问题。进入那个模型的东西。鉴于：

/X/

匹配某些东西，然后是表达条件的方式

    ! /X/

在单个正匹配模式中将其写为

    /\A (?: (?! X ) . ) * \z /sx

因此，鉴于正模式是

    / (\pL) .* \1 /sxi

相应的负面需求必须是

    /\A (?: (?! (\pL) .* \1  ) . ) * \z /sxi

通过简单替换 X。

现实世界的关注

尽管如此，有些可能需要更多工作的减值问题。例如，虽然\pL描述了具有 GeneralCategory = Letter 属性的任何代码点，但它不考虑如何处理 red-violet-coloured 等字词， 'Tis not 或fiancée - 后者与其他等效的NFD与NFC形式不同。

因此，您必须首先通过完全分解运行它，以便像"r\x{E9}sume\x{301}"这样的字符串能够正确检测到重复的“字母é” - 即所有规范等效的字形集群单元。

要考虑到这些，您必须至少首先通过NFD分解运行字符串，然后通过\X使用字形集群，而不是通过.使用任意代码点。

所以对于英语，你会想要在这些行之后跟随积极匹配的东西，每个替换的相应的负匹配给出：

    NFD($string) =~ m{
        (?<ELEMENT>
           (?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X 
        )
        \X *
        \k<ELEMENT>
    }xi

但即使如此，仍然存在尚未解决的某些悬而未决的问题，例如\N{EN DASH}和\N{HYPHEN}是否应被视为等效元素或不同元素。

这是因为写得正确，连字两个元素，如 red-violet 和 color ，形成单个复合词 red-violet-colored ，其中至少有一个已经包含连字符，要求使用EN DASH作为分隔符而不仅仅是HYPHEN。

通常，EN DASH保留用于类似性质的化合物，例如时空权衡。然而，使用打字机 - 英语的人甚至不会这样做，使用超大量重载遗留代码点，HYPHEN-MINUS，两者：红紫色。

这取决于你的文字来自一些19世纪的手动打字机 - 或者它是否代表在现代排版规则下正确呈现的英文文本。：）

尽职调查案件不敏感

你会注意到我在这里考虑单独的情况不同的字母是相同的。那是因为我使用/i正则表达式开关，ᴀᴋᴀ(?i)模式修饰符。

那是而非就像说它们与校对力量1相同 - 但并不完全，因为Perl仅使用案例折叠（尽管完整案例折叠不是简单的）用于不区分大小写的匹配，而不是一些比高级水平更高的校对强度。

主要校对强度的完全等效是一个明显更强的陈述，但在一般情况下可能需要完全解决问题。但是，需要的工作量远远超过许多特定情况下必然需要的问题。简而言之，无论假设的一般案例可能需要多少，实际出现的许多具体案例都是过度的。

这更加困难，因为尽管你可以这样做：

    my $collator = new Unicode::Collate::Locale::
                       level => 1, 
                       locale => "de__phonebook",
                       normalization => undef,
                    ;

    if ($collator->cmp("müß", "MUESS") == 0) { ... }

并期望得到正确的答案 - 你做到了，欢呼！ - 这种强大的字符串比较不容易扩展到正则表达式匹配。

然而。：）

摘要

选择是否设计不足或过度设计解决方案将根据个人情况而有所不同，没有人可以为您决定。

我喜欢CJM的解决方案，否定了一个积极的匹配，我自己，虽然它对于它认为重复的信件有些讽刺。注意：

    while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
        print "The letter <$2> is duplicated in the substring <$1>.\n";
    }

产生

    The letter <e> is duplicated in the substring <e__phone>.
    The letter <_> is duplicated in the substring <__>.
    The letter <o> is duplicated in the substring <onebo>.
    The letter <o> is duplicated in the substring <oo>.

这说明为什么当你需要匹配一个字母时，你应该 alwasy 使用\pLᴀᴋᴀ\p{Letter}而不是\w，它实际匹配{{1} }}。

当然，当你需要匹配字母时，你需要使用[\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}]ᴀᴋᴀ\p{alpha}，这与一个单纯的字母完全相同 - 与流行的误解相反。：）

Answer 5

如果您正在处理可能包含重复字母的 long 字符串，请尽快停止帮助。

INPUT: for (@input) {
   my %seen;
   while (/(.)/sg) {
      next INPUT if $seen{$1}++;
   }
   say;
}

我会选择最简单的解决方案，除非发现性能真的不可接受。

my @output = grep !/(.).*?\1/s, @input;

Answer 6

我很好奇其他作者为这个问题提交的各种基于Perl的方法的相对速度。所以，我决定对它们进行基准测试。

必要时，我稍微修改了每个方法，以便填充@output数组，以保持输入和输出的一致性。我验证了所有方法产生相同的@output，尽管我没有在这里记录这个断言。

以下是对各种方法进行基准测试的脚本：

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(cmpthese :hireswallclock);

# get a convenient list of words (on Mac OS X 10.6.6, this contains 234,936 entries)
open (my $fh, '<', '/usr/share/dict/words') or die "can't open words file: $!\n";
my @input = <$fh>;
close $fh;

# remove line breaks
chomp @input;

# set-up the tests (
my %tests = (

  # Author: cjm
  RegExp => sub { my @output = grep { not /(\w).*\1/ } @input },

  # Author: daotoad
  SplitCount => sub { my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input; },

  # Author: ikegami
  NextIfSeen => sub {
    my @output;
    INPUT: for (@input) {
      my %seen;
      while (/(.)/sg) {
        next INPUT if $seen{$1}++;
      }
      push @output, $_;
    }

  },

  # Author: ysth
  BitMask => sub {
    my @output;
    for my $word (@input) {
      my $mask1 = $word x ( length($word) - 1 );
      my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
      if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        push @output, $word;
      }
    }
  },

);

# run each test 100 times
cmpthese(100, \%tests);

以下是100次迭代的结果。

           s/iter SplitCount    BitMask NextIfSeen     RegExp
SplitCount   2.85         --       -11%       -58%       -85%
BitMask      2.54        12%         --       -53%       -83%
NextIfSeen   1.20       138%       113%         --       -64%
RegExp      0.427       567%       496%       180%         --

正如你所看到的，cjm的“RegExp”方法是迄今为止最快的。它比下一个最快的方法，即池上的“NextIfSeen”方法快180％。我怀疑RegExp和NextIfSeen方法的相对速度会随着输入字符串的平均长度的增加而收敛。但对于“正常”长度的英语单词，RegExp方法是最快的。

Answer 7

cjm给了正则表达式，但这是一个有趣的非正则表达方式：

@words = qw/abducts abe abeam abel abele/;
for my $word (@words) {
    my $mask1 = $word x ( length($word) - 1 );
    my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
    if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        print "$word\n";
    }
}

Answer 8

为了回应cjm的解决方案，我想知道它与一些相当简洁的Perl相比如何：

my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input;

由于我在字符数和格式方面没有受到约束，所以即使过度记录，我也会更加清晰：

my @output = grep {

    # Split $_ on the empty string to get letters in $_. 
    my @letters = split '';

    # Use a hash to remove duplicate letters.
    my %unique_letters;
    @unique_letters{@letters} = ();  # This is a hash slice assignment.
                                     # See perldoc perlvar for more info

    # is the number of unique letters equal to the number of letters?
    keys %unique_letters == @letters

} @input;

当然，在生产代码中，请执行以下操作：

my @output = grep ! has_repeated_chars($_), @input;

sub has_repeated_letters {
    my $word = shift;
    #blah blah blah
    # see example above for the code to use here, with a nip and a tuck.
}

Answer 9

在带有正则表达式的python中：

python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt

在没有正则表达式的python中：

python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt

我使用硬编码文件名执行了一些时序测试，并将输出重定向到/ dev / null以避免在时间中包含输出：

没有正则表达式的计时：

python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop

使用正则表达式的时间：

python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop

显然，正则表达式比python中的简单冻结集创建和len比较慢一点。

Answer 10

你无法用Regex做到这一点。正则表达式是一个有限状态机，这需要一个堆栈来存储已经看到的字母。

我建议用foreach执行此操作，并使用代码手动检查每个单词。像

这样的东西

List chars
foreach word in list
    foreach letter in word
        if chars.contains letter then remove word from list
        else
            chars.Add letter
    chars.clear

仅使用正则表达式从列表中提取不包含重复字母的单词

10 个答案:

简单的东西

现实世界的关注

尽职调查案件不敏感

摘要