Question

我正在从Amazon S3检索大型压缩文件。我希望能够实时转换这些文件的每一行，并将输出上传到另一个S3存储桶。

上载API以InputStream作为输入。

S3Object s3object = s3.fetch(bucket, key);

InputStream is = new GZIPInputStream(s3object.getObjectContent());

// . . . ?

s3.putObject(new PutObjectRequest(bucket, key, is, metadata));

我相信，最有效的方法是创建自己的自定义输入流，它将原始输入流转换为另一个输入流。我对这种方法不是很熟悉，并且想了解更多信息。

Answer 1

基本思想如下。

效率不是很高，但是应该完成工作。

public class MyInputStream extends InputStream {

    private final BufferedReader input;
    private final Charset encoding = StandardCharsets.UTF_8;
    private ByteArrayInputStream buffer;

    public MyInputStream(InputStream is) throws IOException {
        input = new BufferedReader(new InputStreamReader(is, this.encoding));
        nextLine();
    }

    @Override
    public int read() throws IOException {
        if (buffer == null) {
            return -1;
        }
        int ch = buffer.read();
        if (ch == -1) {
            if (!nextLine()) {
                return -1;
            }
            return read();
        }
        return ch;
    }

    private boolean nextLine() throws IOException {
        String line;
        while ((line = input.readLine()) != null) {
            line = filterLine(line);
            if (line != null) {
                line += '\n';
                buffer = new ByteArrayInputStream(line.getBytes(encoding));
                return true;
            }
        }
        return false;
    }

    @Override
    public void close() throws IOException {
        input.close();
    }

    private String filterLine(String line) {
        // Filter the line here ... return null to skip the line
        // For example:
        return line.replace("ABC", "XYZ");
    }

}

nextLine()用（过滤的）行预填充行缓冲区。然后read()（由上载作业调用）从缓冲区中一次取字节，并再次调用nextLine()以加载下一行。

用作：

s3.putObject(new PutObjectRequest(bucket, key, new MyInputStream(is), metadata));

性能的提高可能是还实现了int read(byte[] b, int off, int len)方法（如果CPU使用率很高），并且在S3客户端未在内部使用缓冲区的情况下使用BufferedInputStream（我不知道）。

Answer 2

new BufferedReader(is).lines()

逐行过滤InputStream

2 个答案: