Question

尝试学习正则表达式。作为一种练习，我试图在我的文档中找到恰好一次出现的每个单词 - 在语言学中，这是一个hapax legemenon（http://en.wikipedia.org/wiki/Hapax_legomenon）

所以我认为以下表达式给出了我想要的结果：

\w{1}

但这不起作用。 \w返回的字符不是整个单词。此外，它似乎没有给我出现只出现一次的字符（它实际上返回25873个匹配 - 我假设它们都是字母数字字符）。有人能给我一个如何用正则表达式找到“hapax legemenon”的例子吗？

Answer 1

如果您尝试将此作为学习练习，那么您选择了一个非常棘手的问题：）

首先，这是解决方案：

\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)

现在，解释如下：

我们希望匹配一个单词。这是\b\w+\b - 一个或多个（+）单词字符（\w）的运行，两边都有'分词'（\b）。在单词字符和非单词字符之间发生单词中断，因此这将在（例如）单词字符和空格之间或在字符串的开头和结尾处匹配。我们还使用括号（(...)）将单词捕获到反向引用中。这意味着我们可以稍后参考匹配。
接下来，我们要排除这个词已经出现在字符串中的可能性。这可以通过使用负面的背后隐藏 - (?<! ... )来完成。如果其内容与此时的字符串匹配，则负向后视不会匹配。因此，如果我们匹配的单词已经出现，我们希望不匹配。我们通过对已捕获的单词使用反向引用（\1）来完成此操作。这里的最终匹配是\b\1\b.*\b\1\b - 当前匹配的两个副本，由任意数量的字符串（.*）分隔。
最后，如果在字符串的其余部分中任何地方都有该字的另一个副本，我们不希望匹配。我们使用否定前瞻 - (?! ... )来完成此操作。如果其内容在字符串中的此点匹配，则否定前瞻不匹配。我们希望在任意数量的字符串后匹配当前单词，因此我们使用（.*\b\1\b）。

这是一个例子（使用C＃）：

var s = "goat goat leopard bird leopard horse";

foreach (Match m in Regex.Matches(s, @"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
    Console.WriteLine(m.Value);

输出：

bird
horse

Answer 2

如果正则表达式引擎在lookbehind断言（例如.NET）中支持无限重复，则可以在单个正则表达式中完成：

Regex regexObj = new Regex(
    @"(       # Match and capture into backreference no. 1:
     \b       # (from the start of the word)
     \p{L}+   # a succession of letters
     \b       # (to the end of a word).
    )         # End of capturing group.
    (?<=      # Now assert that the preceding text contains:
     ^        # (from the start of the string)
     (?:      # (Start of non-capturing group)
      (?!     #  Assert that we can't match...
       \b\1\b #  the word we've just matched.
      )       #  (End of lookahead assertion)
      .       #  Then match any character.
     )*       # Repeat until...
     \1       # we reach the word we've just matched.
    )         # End of lookbehind assertion.
    # We now know that we have just matched the first instance of that word.
    (?=       # Now look ahead to assert that we can match the following:
     (?:      # (Start of non-capturing group)
      (?!     #  Assert that we can't match again...
       \b\1\b #  the word we've just matched.
      )       #  (End of lookahead assertion)
      .       #  Then match any character.
     )*       # Repeat until...
     $        # the end of the string.
    )         # End of lookahead assertion.", 
    RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();
}

Answer 3

如果您尝试匹配英语单词，最佳表单为：

[a-zA-Z]+

\w的问题在于它还包含_和数字0-9。

如果您需要添加其他字符，可以在Z之后但]之前添加。或者，您可能需要首先规范化输入文本。

现在，如果你想要计算所有单词，或者只是为了看到不会出现多次的单词，你就不能用一个正则表达式来做到这一点。你需要花一些时间来编写更复杂的逻辑。它可能需要由数据库或某种内存结构支持以跟踪计数。解析并计算整个文本后，您可以搜索计数为1的单词。

Answer 4

(\w+){1}会匹配每个单词。之后你总是可以对比赛进行计数......

Answer 5

更高级别的解决方案：

创建匹配数组：

preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);

让PHP计算你的数组元素：

$tmp_array = array_count_values($matches[1]);

迭代tmp数组并检查字数：

foreach ($tmp_array as $word => $count) {
    echo $word . '  ' . $count;
}

Answer 6

低水平，但做你想做的事：

使用split：

将文本传递给数组

$array = split('\s+', $text);

迭代该数组：

foreach ($array as $word) { ... }

检查每个单词是否为单词：

if (!preg_match('/[^a-zA-Z]/', $word) continue;

将单词添加到临时数组作为键：

if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;

循环之后。迭代tmp数组并检查字数：

foreach ($tmp_array as $word => $count) {
    echo $word . '  ' . $count;
}

常用表达

6 个答案: