Question

我想从大约100万个名字的大量列表中找到一组文本文档中的名字。我首先从列表名称中创建一个模式：

    BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));

    String dataRow = TSVFile.readLine();
    dataRow = TSVFile.readLine();// skip first line (header)

    String combined = "";
    while (dataRow != null) {
        String[] dataArray = dataRow.split("\t");
        String name = dataArray[1];
        combined += name.replace("\"", "") + "|";

        dataRow = TSVFile.readLine(); // Read next line of data.
    }
    TSVFile.close();
    Pattern all = Pattern.compile(combined);

执行此操作后，我得到IllegalPatternSyntax异常，因为某些名称在其名称或其他Regex表达式中包含'+'。我试图通过忽略几个名字来解决这个问题：

    if(name.contains("\""){
    //ignore this name }

没有正常工作但也很乱，因为你必须手动逃避一切并多次运行并浪费你的时间。然后我尝试使用quote方法：

   Pattern all = Pattern.compile(Pattern.quote(combined));

但是现在，我在文本文档中找不到任何匹配项，即使我也在它们上使用quote。我该如何解决这个问题？

Answer 1

我同意@ dragon66的评论，你不应该引用管道＆＃34; |＆＃34;。因此，使用Pattern.quote()：

，您的代码就像下面的代码一样

BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));

String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)

String combined = "";
while (dataRow != null) {
    String[] dataArray = dataRow.split("\t");
    String name = dataArray[1];
    combined += Pattern.quote(name.replace("\"", "")) + "|"; //line changed

    dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
Pattern all = Pattern.compile(combined);

另外，我建议验证您的问题域是否需要优化，而不是使用String combined = "";替换不可变StringBuilder类，以避免在循环内创建不必要的新字符串。

Answer 2

guilhermerama向您的代码提供了错误修正。

我会添加一些性能改进。正如我所指出的那样，java的正则表达式库不能扩展，如果用于搜索，它甚至会更慢。

但是使用Multi-String-Seach算法可以做得更好。例如，使用StringsAndChars String Search：

//setting up a test file
Iterable<String> lines = createLines();
Files.write(Paths.get("names.tsv"), lines , CREATE, WRITE, TRUNCATE_EXISTING);

// read the pattern from the file
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));

Set<String> combined = new LinkedHashSet<>();

String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)

while (dataRow != null) {
  String[] dataArray = dataRow.split("\t");
  String name = dataArray[1];
  combined.add(name);

  dataRow = TSVFile.readLine(); // Read next line of data.
}

TSVFile.close();

// search the pattern in a small text
StringSearchAlgorithm stringSearch = new AhoCorasick(new ArrayList<>(combined));
StringFinder finder = stringSearch.createFinder(new StringCharProvider("test " + name(38) + "\n or " + name(799) + " : " + name(99999), 0));
System.out.println(finder.findAll());

结果将是

[5:10(00038), 15:20(00799), 23:28(99999)]

搜索（finder.findAll()）确实（在我的计算机上）＆lt; 1毫秒。对java.util.regex执行相同操作大约需要20毫秒。

您可以使用RexLex提供的其他算法来调整此性能。

按照以下代码设置需求：

private static Iterable<String> createLines() {
    List<String> list = new ArrayList<>();
    for (int i = 0; i < 100000; i++)  {
        list.add(i + "\t" + name(i));
    }
    return list;
}

private static String name(int i) {
    String s = String.valueOf(i);
    while (s.length() < 5)  {
        s = '0' + s;
    }
    return s;
}

处理PatternSyntaxException并扫描文本

2 个答案: