Question

我正在尝试编写一个正则表达式，它将计算两个单词在一个字符串中某个邻近区域内（在彼此的5个单词内）共同出现的次数，而不会重复计算单词。

例如，如果我有一个字符串：

“这个男人喜欢他的大帽子。帽子非常大。”

在这种情况下，正则表达式应该在第一句中看到“大帽子”，在第二句中看到“帽子很大”，总共返回2.注意在第二句中，有两个单词之间“帽子”和“大”，它们的出现顺序与第一句不同，但它们仍然出现在5个字的窗口内。

如果正则表达式不是解决此问题的正确方法，请告诉我应该尝试的内容。

Answer 1

如果正则表达式不是解决此问题的正确方法，请告诉我应该尝试的内容。

正则表达式可能有效，但它们不是最好的方法。

更好的方法是将输入字符串分解为一系列单词（例如使用String.split(...)），然后循环执行如下所示的序列：

String[] words = input.split("\\s");
int count = 0;
for (int i = 0; i < words.length; i++) {
    if (words[i].equals("big")) {
        for (int j = i + 1; j < words.length && j - i < 5; j++) {
            if (words[j].equals("hat")) {
                count++;
            }
        }
    }
}
// And repeat for "hat" followed by "big".

您可能需要根据您想要计算的内容进行更改，但这是一般的想法。

如果您需要为许多单词组合执行此操作，那么值得寻找更有效的解决方案。但作为一次性或低容量的用例，最简单的是最好的。

Answer 2

有点像Stephen C，但是使用库类来辅助机制。

    String input = "The man liked his big hat. The hat was very big";
    int proximity = 5;

    // split input into words
    String[] words = input.split("[\\W]+");

    // create a Deque of the first <proximity> words
    Deque<String> haystack = new LinkedList<String>(Arrays.asList(Arrays.copyOfRange(words, 0, proximity)));

    // count duplicates in the first <proximity> words
    int count = haystack.size() - new HashSet<String>(haystack).size();
    System.out.println("initial matches: " + count);

    // process the rest of the words
    for (int i = proximity; i < words.length; i++) {
        String word = words[i];
        System.out.println("matching '" + word + "' in [" + haystack + "]");

        if (haystack.contains(word)) {
            System.out.println("matched word " + word + " at index " + i);
            count++;
        }

        // remove the first word
        haystack.removeFirst();
        // add the current word
        haystack.addLast(word);
    }

    System.out.println("total matches:" + count);

Answer 3

Gee ......其他答案中的所有代码......这个单行解决方案怎么样：

int count = input.split("big( \\b.*?){1,5}hat").length + input.split("hat( \\b.*?){1,5}big").length - 2;

Answer 4

这个正则表达式将匹配彼此5个单词中共同出现的两个单词的每个出现

([a-zA-Z]+)(?:[^ ]* ){0,5}\1[^a-zA-Z]

([a-zA-Z]+)会匹配单词，如果你能在你可以替换的单词中匹配[0-9]（[a-zA-Z0-9] +）。
(?:[^ ]* ){0,5}匹配0到5个字
\1[^a-zA-Z]以匹配单词的重复

然后你可以将它与模式一起使用，并找到重复单词的每个出现

用于查找紧密相邻的两个单词的Java正则表达式

4 个答案: