Question

我需要解析epub文件的内容，我正在尝试查看最有效的方法。 epub文件可能包含图像，大量文本和偶尔的视频。我应该去FileInputStream还是FileReader？

Answer 1

由于epub使用ZIP存档结构，我建议将其处理。找到下面列出epub文件内容的小片段。

Map<String, String> env = new HashMap<>();
env.put("create", "true");

Path path = Paths.get("foobar.epub");
URI uri = URI.create("jar:" + path.toUri());
FileSystem zipFs = FileSystems.newFileSystem(uri, env);
Path root = zipFs.getPath("/");
Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
    @Override
    public FileVisitResult visitFile(Path file,
            BasicFileAttributes attrs) throws IOException {
        print(file);
        return FileVisitResult.CONTINUE;
    }

    @Override
    public FileVisitResult preVisitDirectory(Path dir,
            BasicFileAttributes attrs) throws IOException {
        print(dir);
        return FileVisitResult.CONTINUE;
    }

    private void print(Path file) throws IOException {
        Date lastModifiedTime = new Date(Files.getLastModifiedTime(file).toMillis());
        System.out.printf("%td.%<tm.%<tY %<tH:%<tM:%<tS %9d %s\n", 
                lastModifiedTime, Files.size(file), file);
    }
});

示例输出

01.01.1970 00:59:59         0 /META-INF/
11.02.2015 16:33:44       244 /META-INF/container.xml
11.02.2015 16:33:44      3437 /logo.jpg
...

修改如果您只想根据自己的名称提取文件，可以像visitFile(...)方法的代码段中所示进行操作。

public FileVisitResult visitFile(Path file,
    BasicFileAttributes attrs) throws IOException {
    // if the filename inside the epub end with "*logo.jpg"
    if (file.endsWith("logo.jpg")) {
        // extract the file in directory /tmp/
        Files.copy(file, Paths.get("/tmp/",
            file.getFileName().toString()));
    }
    return FileVisitResult.CONTINUE;
}

根据您希望如何处理epub中的文件，您可能还需要查看ZipInputStream。

try (ZipInputStream in = new ZipInputStream(new FileInputStream("foobar.epub"))) {
    for (ZipEntry entry = in.getNextEntry(); entry != null; 
        entry = in.getNextEntry()) {
        System.out.printf("%td.%<tm.%<tY %<tH:%<tM:%<tS %9d %s\n",
                new Date(entry.getTime()), entry.getSize(), entry.getName());
        if (entry.getName().endsWith("logo.jpg")) {
            try (FileOutputStream out = new FileOutputStream(entry.getName())) {
                // process the file
            }
        }
    }
}

示例输出

11.02.2013 16:33:44       244 META-INF/container.xml
11.02.2013 16:33:44      3437 logo.jpg

Answer 2

将整个文件作为字节读取的最简单方法（如果它不是纯文本，那就是你想要的）是使用java.nio.file.Files类：

byte[] content = Files.readAllBytes(Paths.get("example.epub"));

此方法的优点：

更少的代码=代码变得更易读并且错误的可能性更小
java关心打开和关闭文件

修改

为了快速读取文件，您也可以使用java.nio。这次java.nio.channels.FileChannel：

import java.io.FileInputStream; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; // Load the file FileChannel c = new FileInputStream("example.epub").getChannel(); MappedByteBuffer byteBuffer = c.map(FileChannel.MapMode.READ_ONLY, 0, channel.size()); // Process the data buffer.get(myByte, 1120, 50); // when finished c.close();

这不会将整个文件读入内存，但会创建指向该文件的链接，并仅读取（缓冲区）您尝试访问的部分。它还将识别文件的更改并始终返回最新内容。

我可以使用FileReader来读取包含图像和视频（例如epub文件）和文本的文件，并建议在性能方面这样做。

2 个答案: