Question

我需要使用一个包含String，String对的大文件，因为我想用JAR发送它，我选择在应用程序的资源文件夹中包含一个序列化和gzip压缩版本。这就是我创建序列化的方式：

ObjectOutputStream out = new ObjectOutputStream(
            new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(OUT_FILE_PATH, false))));
out.writeObject(map);
out.close();

我选择使用HashMap<String,String>，结果文件为60MB，地图包含大约400万条目。

现在，当我需要地图时，我使用以下方法对其进行反序列化：

final InputStream in = FileUtils.getResource("map.ser.gz");
final ObjectInputStream ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(in)));
map = (Map<String, String>) ois.readObject();
ois.close();

这需要大约10~15秒。有没有更好的方法在JAR中存储这么大的地图？我问，因为我也使用了斯坦福CoreNLP库，该库本身使用大型模型文件，但在这方面似乎表现更好。我试图找到模型文件被读取但放弃的代码。

Answer 1

您的问题是您压缩了数据。存储纯文本。

性能打击最有可能是解压缩流。罐子已经压缩，因此无法节省存储压缩文件的空间。

基本上：

以纯文本格式存储文件
使用Files.lines(Paths.get("myfilenane.txt"))流式传输
使用最少的代码消耗每一行

这样的事情，假设数据的格式为key=value（如属性文件）：

Map<String, String> map = new HashMap<>();
Files.lines(Paths.get("myfilenane.txt"))
  .map(s -> s.split("="))
  .forEach(a -> map.put(a[0], a[1]));

_{免责声明：代码可能无法编译或工作，因为它在我的手机上被翻阅（但它有合理的可能性）}

Answer 2

你可以做的是应用来自 Scott Oaks 的书 Java Performance：权威指南的技术，它实际上将对象的压缩内容存储到字节数组所以为此我们需要一个我在这里调用的包装类MapHolder：

public class MapHolder implements Serializable {
    // This will contain the zipped content of my map
    private byte[] content;
    // My actual map defined as transient as I don't want to serialize its 
    // content but its zipped content
    private transient Map<String, String> map;

    public MapHolder(Map<String, String> map) {
        this.map = map;
    }

    private void writeObject(ObjectOutputStream out) throws IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        try (GZIPOutputStream zip = new GZIPOutputStream(baos);
            ObjectOutputStream oos = new ObjectOutputStream(
                new BufferedOutputStream(zip))) {
            oos.writeObject(map);
        }
        this.content = baos.toByteArray();
        out.defaultWriteObject();
        // Clear the temporary field content
        this.content = null;
    }

    private void readObject(ObjectInputStream in) throws IOException,
        ClassNotFoundException {
        in.defaultReadObject();
        try (ByteArrayInputStream bais = new ByteArrayInputStream(content);
            GZIPInputStream zip = new GZIPInputStream(bais);
            ObjectInputStream ois = new ObjectInputStream(
                new BufferedInputStream(zip))) {
            this.map = (Map<String, String>) ois.readObject();
            // Clean the temporary field content
            this.content = null;
        }
    }

    public Map<String, String> getMap() {
        return this.map;
    }
}

您的代码将简单地为：

final ByteArrayInputStream in = new ByteArrayInputStream(
    Files.readAllBytes(Paths.get("/tmp/map.ser"))
);
final ObjectInputStream ois = new ObjectInputStream(in);
MapHolder holder = (MapHolder) ois.readObject();
map = holder.getMap();
ois.close();

您可能已经注意到，在序列化MapHolder实例时，您不再压缩内部压缩的内容。

Answer 3

您可以考虑许多快速序列化库中的一个：

protobuf（https://github.com/google/protobuf）
flat buffers（https://google.github.io/flatbuffers/）
cap＆＃39; n proto（https://capnproto.org）

Java：在资源中存储大地图

3 个答案: