Question

我有一个正则表达式模式的单词，例如welcome1|welcome2|changeme ...，我需要搜索成千上万个文件（范围从1KB到24 MB），文件大小从100 KB到8000 MB不等。

我想知道是否有比我尝试过的模式匹配更快的方法。

环境：

jdk 1.8
Windows 10
Unix4j Library

这是我到目前为止尝试过的

try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                                    .filter(FilePredicates.isFileAndNotDirectory())) {

        List<String> obviousStringsList = Strings_PASSWORDS.stream()
                                                .map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this

        Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));

        GrepOptions options = new GrepOptions.Default(GrepOption.count,
                                                        GrepOption.ignoreCase,
                                                        GrepOption.lineNumber,
                                                        GrepOption.matchingFiles);
        Instant startTime = Instant.now();

        final List<Path> filesWithObviousStringss = stream
                .filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
                .collect(Collectors.toList());

        System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}

我得到Time taken = 60 seconds，这让我觉得我做错了什么事情。

我对流使用了不同的方法，平均每种方法需要大约一分钟来处理当前的6660个文件文件夹。

mysys2 / mingw64上的Grep大约需要15秒，而node.js中的exec('grep...')大约需要12秒。

我选择Unix4j是因为它提供了Java本机grep和清晰的代码。

有一种方法可以在Java中产生更好的结果，但我很遗憾地错过了？

Answer 1

我从来没有使用过Unix4j，但是现在Java也提供了不错的文件API。另外，Unix4j#grep似乎会返回所有找到的匹配项（使用.toStringResult().isEmpty()时），而您似乎只需要知道是否至少找到了一个匹配项（这意味着您应该能够找到一个匹配项后 break ）。也许此库提供了另一种更适合您需求的方法，例如#contains之类的东西？在不使用Unix4j的情况下，Stream#anyMatch可能是一个不错的选择。如果您想与自己的Java比较，这是一种通用的Java解决方案：

private boolean lineContainsObviousStrings(String line) {
  return Strings_PASSWORDS // <-- weird naming BTW
    .stream()
    .anyMatch(line::contains);
}

private boolean fileContainsObviousStrings(Path path) {
  try (Stream<String> stream = Files.lines(path)) {
    return stream.anyMatch(this::lineContainsObviousStrings);
  }
}

public List<Path> findFilesContainingObviousStrings() {
  Instant startTime = Instant.now();
  try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))) {
    return stream
      .filter(FilePredicates.isFileAndNotDirectory())
      .filter(this::fileContainsObviousStrings)
      .collect(Collectors.toList());
  } finally {
    Instant endTime = Instant.now();
    System.out.println("Time taken = " + Duration.between(startTime, endTime).getSeconds() + " seconds");
  }
}

Answer 2

我不知道JDK中尚未提供“ Unix4j”所提供的功能，因为以下代码利用内置功能来完成所有工作：

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
        Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
        long startTime = System.nanoTime();
        final List<Path> filesWithObviousStringss = stream
                .filter(path -> {
                    try(Scanner s = new Scanner(path)) {
                        return s.findWithinHorizon(pattern, 0) != null;
                    } catch(IOException ex) {
                        throw new UncheckedIOException(ex);
                    }
                })
                .collect(Collectors.toList());
        System.out.println("Time taken = "
            + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

该解决方案的一个重要属性是，它不会读取整个文件，而是在遇到的第一个匹配项处停止。此外，它不会处理行边界，因为您的单词永远不会包含换行符，所以它适合您要查找的单词。

在分析findWithinHorizon操作之后，我认为逐行处理对于较大的文件可能更好，因此，您可以尝试

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
        Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
        long startTime = System.nanoTime();
        final List<Path> filesWithObviousStringss = stream
                .filter(path -> {
                    try(Stream<String> s = Files.lines(path)) {
                        return s.anyMatch(pattern.asPredicate());
                    } catch(IOException ex) {
                        throw new UncheckedIOException(ex);
                    }
                })
                .collect(Collectors.toList());
        System.out.println("Time taken = "
            + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

相反。

您还可以尝试将流设为并行模式，例如

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
        Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
        long startTime = System.nanoTime();
        final List<Path> filesWithObviousStringss = stream
                .parallel()
                .filter(path -> {
                    try(Stream<String> s = Files.lines(path)) {
                        return s.anyMatch(pattern.asPredicate());
                    } catch(IOException ex) {
                        throw new UncheckedIOException(ex);
                    }
                })
                .collect(Collectors.toList());
        System.out.println("Time taken = "
            + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

很难预测这样做是否有好处，因为在大多数情况下，I / O占主导地位。

Answer 3

本机工具可以更快地处理此类文本文件的主要原因是它们假定一个特定的字符集，尤其是当它具有基于ASCII的8位编码，而Java执行字节到字符转换时，其抽象能力可以支持任意字符集。

当我们类似地假设具有上述属性的单个字符集时，我们可以使用低级工具来显着提高性能。

对于此类操作，我们定义了以下辅助方法：

private static char[] getTable(Charset cs) {
    if(cs.newEncoder().maxBytesPerChar() != 1f)
        throw new UnsupportedOperationException("Not an 8 bit charset");
    byte[] raw = new byte[256];
    IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
    char[] table = new char[256];
    cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
      .decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
    for(int i = 0; i < 128; i++)
        if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
    return table;
}

和

private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
    try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
        long actualSize = fch.size();
        int size = (int)actualSize;
        if(size != actualSize) throw new UnsupportedOperationException("file too large");
        MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
        final class MappedCharSequence implements CharSequence {
            final int start, size;
            MappedCharSequence(int start, int size) {
                this.start = start;
                this.size = size;
            }
            public int length() {
                return size;
            }
            public char charAt(int index) {
                if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
                byte b = mbb.get(start + index);
                return b<0? table[b+256]: (char)b;
            }
            public CharSequence subSequence(int start, int end) {
                int newSize = end - start;
                if(start<0 || end < start || end-start > size)
                    throw new IndexOutOfBoundsException();
                return new MappedCharSequence(start + this.start, newSize);
            }
            public String toString() {
                return new StringBuilder(size).append(this).toString();
            }
        }
        return new MappedCharSequence(0, size);
    }
}

这允许将文件映射到虚拟内存并将其直接投影到CharSequence，而无需进行复制操作，前提是可以使用简单的表进行映射，并且对于基于ASCII的字符集，可以使用大多数表这些字符甚至不需要查表，因为它们的数值与Unicode代码点相同。

使用这些方法，您可以将操作实现为

// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());

try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
                               .filter(Files::isRegularFile)) {
    Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
    long startTime = System.nanoTime();
    final List<Path> filesWithObviousStringss = stream//.parallel()
            .filter(path -> {
                try {
                    return pattern.matcher(mapAsciiBasedText(path, table)).find();
                } catch(IOException ex) {
                    throw new UncheckedIOException(ex);
                }
            })
            .collect(Collectors.toList());
    System.out.println("Time taken = "
        + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}

这比普通的文本转换要快得多，但仍支持并行执行。

除了需要基于ASCII的单字节编码外，还有一些限制，即该代码不支持大于2GiB的文件。尽管可以扩展解决方案以支持更大的文件，但除非确实需要，否则我不会添加这种复杂性。

Answer 4

请尝试一下（如果可能的话），我很好奇它如何在您的文件上执行。

<div class="container">
  <div class="wrap1">
    <div class="title">How do we shop our carriers to find you the best price when we have so many?</div>
    <div class="learn">Learn More!</div>
  </div>

</div>
<!--container-->

数千个文件中的模式匹配

4 个答案: