Question

我需要能够抓取一个在线目录，例如这个http://svn.apache.org/repos/asf/，每当pdf，docx，txt或odt文件遇到爬行时，我需要能够解析，并且从中提取文本。

我正在使用files.walk在我的笔记本电脑中本地抓取，Apache Tika库来解析文本，它工作正常，但我真的不知道怎么办同样在在线目录中。

这是通过我的PC并解析文件的代码，这样你们就知道我在做什么：

public static void GetFiles() throws IOException {
    //PathXml is the path directory such as  "/home/user/" that
    //is taken from an xml file .
    Files.walk(Paths.get(PathXml)).forEach(filePath -> { //Crawling process (Using Java 8)
        if (Files.isRegularFile(filePath)) {
            if (filePath.toString().endsWith(".pdf") || filePath.toString().endsWith(".docx") ||
                    filePath.toString().endsWith(".txt")){
                try {
                    TikaReader.ParsedText(filePath.toString());
                } catch (IOException e) {
                    e.printStackTrace();
                } catch (SAXException e) {
                    e.printStackTrace();
                } catch (TikaException e) {
                    e.printStackTrace();
                }
                System.out.println(filePath);
            }
        }
    });
}

这是TikaReader方法：

public static String ParsedText(String file) throws IOException, SAXException, TikaException {
    InputStream stream = new FileInputStream(file); 
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    try {
        parser.parse(stream, handler, metadata);
        System.out.println(handler.toString());
        return handler.toString();
    } finally {
        stream.close();
    }
}

再说一遍，我怎样才能对上面给出的在线目录做同样的事情？

抓取在线目录并解析在线pdf文档以在java中提取文本

0 个答案: