仅当字符串包含每个列表中的单词时才匹配的RegEx

时间:2018-02-01 13:29:32

标签: java regex permutation

我开发的软件必须检查文本是否包含从指定列表中获取的单词以及从另一个指定列表中获取的单词。

示例:

list 1: dog, cat
list 2: house, tree

以下文字必须匹配:

the dog is in the house -> contains dog and house
my house is full of dogs -> contains dog and house
the cat is on the tree -> contains cat and tree

以下示例必须才能匹配

the frog is in the house -> there is no word from the first list
Boby is the name of my dog -> there is no word from the second list
Outside my house there is a tree -> there is no word from the first list

为了快速解决问题,我已经制作了一个模式清单:

dog.*house, house.*dog, cat.*house, ...

但我非常确定有一种更聪明的方式......

2 个答案:

答案 0 :(得分:0)

您可以为每组备选方案使用替换(|),并为订单更改包装器。所以:

(?:(?:dog|cat).*(?:house|tree))|(?:(?:house|tree).*(?:dog|cat))

JavaScript示例(非捕获组和替换在Java和JavaScript中的工作方式相同):

var tests = [
    {match: true,  text: "the dog is in the house -> contains dog and house"},
    {match: true,  text: "my house is full of dogs -> contains dog and house"},
    {match: true,  text: "the cat is on the tree -> contains cat and tree"},
    {match: false, text: "the frog is in the house -> there is no word from the first list"},
    {match: false, text: "Boby is the name of my dog -> there is no word from the second list"},
    {match: false, text: "Outside my house there is a tree -> there is no word from the first list"}
];
var rex = /(?:(?:dog|cat).*(?:house|tree))|(?:(?:house|tree).*(?:dog|cat))/;
tests.forEach(function(test) {
  var result = rex.test(test.text);
  if (!!result == !!test.match) {
    console.log('GOOD: "' + test.text + '": ' + result);
  } else {
    console.log('BAD: "' + test.text + '": ' + result + ' (expected ' + test.match + ')');
  }
});
.as-console-wrapper {
  max-height: 100% !important;
}

请注意,在上面我们没有检查单词,只检查字母序列。如果您希望它是实际的单词,则需要添加分词断言或类似内容。作为练习留给读者......

答案 1 :(得分:0)

这是一个适用于任意数量列表的解决方案,其中包含任意数量的单词。

RegEx用于扫描线性序列。但是,无论模式的顺序如何,您都会提出两个问题,无论是真是假。因此,您必须枚举正则表达式组合的所有排列。对于少量列表,这可以手动完成,如另一个答案所示。以下是一般情况的解决方案。

你当然不想手动编写正则表达式,所以这里有一个Java程序可以做你想要的:

import java.util.*;
import java.util.stream.*;
import static java.util.Arrays.asList;
import static java.util.stream.Collectors.toList;

public class RegexWithPermutations {

    /** Build a regex the checks whether
      * a string contains one of the words.
      */
    public static String containsWordRegex(List<String> words) {
      StringBuilder sb = new StringBuilder();
      boolean first = true;
      for (String w: words) {
        if (!first) {
          sb.append("|");
        }
        sb.append("(?:" + w + ")");
        first = false;
      }
      return sb.toString();
    }

    /** Generates all permutations of regexes.
      */
    public static String allRegexPermutations(
      final List<String> regexes,
      final String separator
    ) {
      class PermutationHelper {
        /** Deletes one element from the array */
        private int[] remove(int[] arr, int idx) {
          int n = arr.length;
          int[] res = new int[n - 1];
          System.arraycopy(arr, 0, res, 0, idx);
          System.arraycopy(arr, idx + 1, res, idx, n - idx - 1);
          return res;
        }

        /** Helper method that generates all permutations combined with "|".
          */
        public List<String> rec(String suffix, int[] unusedIndices) {
          if (unusedIndices.length == 1) {
            return asList(regexes.get(unusedIndices[0]) + suffix);
          } else {
            return IntStream.range(0, unusedIndices.length)
              .boxed()
              .<String>flatMap(i -> rec(
                separator + regexes.get(unusedIndices[i]), // (suffix.isEmpty() ? "" : ("COMB" + suffix))
                remove(unusedIndices, i)
              ).stream())
              .collect(toList());
          }
        }
      }
      int[] startIndices = new int[regexes.size()];
      for (int i = 0; i < regexes.size(); i++) {
        startIndices[i] = i;
      }
      List<String> ps = (new PermutationHelper()).rec("", startIndices);
      StringBuilder b = new StringBuilder();
      boolean first = true;
      for (String p : ps) {
        if (!first) {
          b.append("|");
        }
        b.append(p);
        first = false;
      }
      return b.toString();
    }

    public static void main(String[] args) {
      List<String> list_1 = asList("dog", "cat");
      List<String> list_2 = asList("house", "tree");  

      List<String> examples = asList(
        "the dog is in the house",
        "my house is full of dogs",
        "the cat is on the tree",
        "the frog is in the house",
        "Boby is the name of my dog",
        "Outside my house there is a tree"
      );

      String regex = ".*(?:" + allRegexPermutations(asList(
        "(?:" + containsWordRegex(list_1) + ")",
        "(?:" + containsWordRegex(list_2) + ")"
      ), ".*") + ").*";

      System.out.println("Constructed regex: " + regex);

      for (String example: examples) {
        System.out.println(example + " -> " + example.matches(regex));
      }
    }
}

输出:

    Constructed regex: .*(?:(?:(?:house)|(?:tree)).*(?:(?:dog)|(?:cat))|(?:(?:dog)|(?:cat)).*(?:(?:house)|(?:tree))).*
    the dog is in the house -> true
    my house is full of dogs -> true
    the cat is on the tree -> true
    the frog is in the house -> false
    Boby is the name of my dog -> false
    Outside my house there is a tree -> false

它适用于任意数量的列表(但正则表达式的长度过度指数增长,因此不建议将其用于长度超过3,4,5的任何内容。)