Question

我试图读取目录的所有子目录中的所有文件。我写了逻辑，但我做了一些错误，因为它在每个文件中读取两次。

为了测试我的实现，我创建了一个包含三个子目录的目录，每个子目录中包含10个文档。这应该是30份文件。

以下是我正在阅读文档的测试代码：

String basePath = "src/test/resources/20NG";
Driver driver = new Driver();
List<Document> documents = driver.readInCorpus(basePath);
assertEquals(3 * 10, documents.size());

Driver#readInCorpus具有以下代码：

public List<Document> readInCorpus(String directory)
{
    try (Stream<Path> paths = Files.walk(Paths.get(directory)))
    {
        return paths
                .filter(Files::isDirectory)
                .map(this::readAllDocumentsInDirectory)
                .flatMap(Collection::stream)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

private List<Document> readAllDocumentsInDirectory(Path path)
{
    try (Stream<Path> paths = Files.walk(path))
    {
        return paths
                .filter(Files::isRegularFile)
                .map(this::readInDocumentFromFile)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

private Document readInDocumentFromFile(Path path)
{
    String fileName = path.getFileName().toString();
    String outputClass = path.getParent().getFileName().toString();
    List<String> words = EmailProcessor.readEmail(path);
    return new Document(fileName, outputClass, words);
}

当我运行测试用例时，我发现assertEquals失败了，因为检索到60个文档，而不是30个，这是不正确的。当我调试时，文档都被插入列表一次，然后再按照完全相同的顺序插入。

在我的代码中，我在文档中阅读了两次？

Answer 1

此处的问题出在<!DOCTYPE html> <html> <head> <title>Javascript homework</title> <meta charset="utf-8" /> <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"> <style type="text/css"> table { border: 1px solid black; } th, td { height: 50px; width: 50px; text-align: center; color: black; background-color: white; font-size: 15px; border: 1px solid black; } </style> </head> <body> <div class="container"> <div class="row"> <div class="col-md-3"></div> <div class="col-md-6"> <table> <tr> <td id="1">1</td> <td id="2">2</td> <td id="3">3</td> </tr> <tr> <td id="4">4</td> <td id="5">5</td> <td id="6">6</td> </tr> <tr> <td id="7">7</td> <td id="8">8</td> <td id="9">9</td> </tr> </table> <button id="random" type="button" onclick="random()">Random</button> <button id="next" type="button" onclick="next()">Next</button> </div> <div class="col-md-3"></div> </div> </div> </body> </html>方法中。你没有错误地使用它。所以它像树一样遍历你的文件系统。例如，您有3个文件夹 - Files.walk(path)和2个子/parent，/parent/first。 /parent/second会为每个文件夹（父项和2个子项）提供树Files.walk("/parent")，实际上这会在您的Paths方法中发生。

然后对于每个readInCorpus，您在第二个方法Path和同一个故事中调用相同的故事，就像树一样遍历文件夹。

对于readAllDocumentsInDirectory路径readAllDocumentsInDirectory，它会返回子文件夹/parent和/parent/first中的所有文档，然后再拨打/parent/second 2个readAllDocumentsInDirectory，/parent/first从两个文件夹中返回文档。

这就是为什么你的文件加倍了。要解决此问题，您只应使用/parent/second参数调用方法readAllDocumentsInDirectory并删除Paths.get(basePath)方法。

Answer 2

看起来这是因为对Paths和Files.walk工作方式的误解。在Driver#readInCorpus中，您有以下流操作：

return paths
        .filter(Files::isRegularFile)
        .map(this::readInDocumentFromFile)
        .collect(Collectors.toList());

您的映射函数（this::readInDocumentFromFile）读取Paths.walk流中每个路径中每个目录的所有文档，其中包括顶级目录和子目录。

这意味着路径中起始目录下面的所有文件都被读取一次，然后在遍历子目录时重新读取。

从查看流中并不完全清楚，但您应该查看How to debug stream().map(...) with lambda expressions?以了解如何更好地调试流并在将来避免此问题。

这意味着您可以跳过调用Driver#readAllDocumentsInDirectory的中间步骤，只需在Driver#readInCorpus中执行此操作：

public List<Document> readInCorpus(String directory)
{
    try (Stream<Path> paths = Files.walk(Paths.get(directory)))
    {
        return paths
                .filter(Files::isRegularFile)
                .map(this::readInDocumentFromFile)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

如何使用Files.walk读取子目录中的所有文件一次？

2 个答案: