从S3中的zip文件中选择性提取条目,而无需下载整个文件

时间:2019-11-03 17:32:25

标签: java amazon-s3 zip

我正在尝试从S3的大型zip文件中提取特定项目,而不下载整个文件。

此处的Python解决方案:Read ZIP files from S3 without downloading the entire file似乎有效。通常,Java中等效的基础功能似乎不太宽松,因此我不得不进行各种调整。

在所附的代码中,您可以看到我已经成功获取了中央目录并将其写入一个临时文件,Java的ZipFile可以使用该临时文件来迭代CD中的zip条目。

但是,我一直坚持给单个条目充气。当前代码引发错误的标头异常。我是否需要为充气机提供本地文件头+压缩内容,还是仅压缩内容?我都尝试过,但是很明显我要么没有使用Inflator校正器,而且/或者没有提供预期的效果。

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.util.Arrays;
import java.util.Enumeration;
import java.util.zip.Inflater;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.util.IOUtils;

public class S3ZipTest {
    private AmazonS3 s3;

    public S3ZipTest(String bucket, String key) throws Exception {
        s3 = getClient();
        ObjectMetadata metadata = s3.getObjectMetadata(bucket, key);
        runTest(bucket, key, metadata.getContentLength());
    }

    private void runTest(String bucket, String key, long size) throws Exception {
        // fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty)
        long start = size - 22;
        GetObjectRequest req = new GetObjectRequest(bucket, key).withRange(start);
        System.out.println("eocd start: " + start);

        // fetch the end of cd record
        S3Object s3Object = s3.getObject(req);
        byte[] eocd = IOUtils.toByteArray(s3Object.getObjectContent());

        // get the start offset and size of the central directory
        int cdSize = byteArrayToLeInt(Arrays.copyOfRange(eocd, 12, 16));
        int cdStart = byteArrayToLeInt(Arrays.copyOfRange(eocd, 16, 20));

        System.out.println("cdStart: " + cdStart);
        System.out.println("cdSize: " + cdSize);

        // get the full central directory
        req = new GetObjectRequest(bucket, key).withRange(cdStart, cdStart + cdSize - 1);
        s3Object = s3.getObject(req);
        byte[] cd = IOUtils.toByteArray(s3Object.getObjectContent());

        // write the full dir + eocd:
        ByteArrayOutputStream out = new ByteArrayOutputStream();

        // write cd
        out.write(cd);

        // write eocd, resetting the cd start to 0 since that is 
        // where it will appear in our new temp file
        byte[] b = leIntToByteArray(0);
        eocd[16] = b[0];
        eocd[17] = b[1];
        eocd[18] = b[2];
        eocd[19] = b[3];
        out.write(eocd);
        out.flush();

        byte[] cdbytes = out.toByteArray();

        // here we are writing the CD + EOCD to a temp file.
        // ZipFile can read the entries from this file.
        // ZipInputStream and commons compress will not- they seem upset that the data isn't actually here
        File tempFile = new File("temp.zip");
        FileOutputStream output = new FileOutputStream(tempFile);
        output.write(cdbytes);
        output.flush();
        output.close();

        ZipFile zipFile = new ZipFile(tempFile);
        Enumeration<? extends ZipEntry> zipEntries = zipFile.entries();
        long offset = 0;
        while (zipEntries.hasMoreElements()) {
            ZipEntry entry = (ZipEntry) zipEntries.nextElement();
            long fileSize = 0;
            long extra = entry.getExtra() == null ? 0 : entry.getExtra().length;
            offset += 30 + entry.getName().length() + extra;
            if (!entry.isDirectory()) {
                fileSize = entry.getCompressedSize();
                System.out.println(entry.getName() + " offset=" + offset + " size" + fileSize);
                // not working
                // getEntryContent(bucket, key, offset, fileSize, (int)entry.getSize());
            }
            offset += fileSize;
        }
        zipFile.close();
    }

    private void getEntryContent(String bucket, String key, long offset, long compressedSize, int fullSize) throws Exception {
        //HERE is where things go bad.
        //my guess was that we need to get past the local header for an entry to the actual 
        //start of deflated content and then read all the content and pass to the Inflator.
        //this yields java.util.zip.DataFormatException: incorrect header check

        System.out.print("reading " + compressedSize +  " bytes starting from offset " + offset);
        GetObjectRequest req = new GetObjectRequest(bucket, key).withRange(offset, offset + compressedSize);
        S3Object s3Object = s3.getObject(req);
        byte[] con = IOUtils.toByteArray(s3Object.getObjectContent());
        Inflater inf = new Inflater();
        inf.setInput(con);
        byte[] inflatedContent = new byte[fullSize];
        int sz = inf.inflate(inflatedContent);
        System.out.println("inflated: " + sz);
        // write inflatedContent to file or whatever...
    }

    public static int byteArrayToLeInt(byte[] b) {
        final ByteBuffer bb = ByteBuffer.wrap(b);
        bb.order(ByteOrder.LITTLE_ENDIAN);
        return bb.getInt();
    }

    public static byte[] leIntToByteArray(int i) {
        final ByteBuffer bb = ByteBuffer.allocate(Integer.SIZE / Byte.SIZE);
        bb.order(ByteOrder.LITTLE_ENDIAN);
        bb.putInt(i);
        return bb.array();
    }

    protected AmazonS3 getClient() {
        AmazonS3 client = AmazonS3ClientBuilder
            .standard()
            .withRegion("us-east-1")
            .build();
        return client;
    }

    public static void main(String[] args) {
        try {
            new S3ZipTest("alexa-public", "test.zip");
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }

}

修改

将Python计算与我的Java代码中的计算进行比较,我意识到Java减少了4。entry.getExtra().length可能报告24,例如同一条目的zipinfo cmd行实用程序也是如此。 Python报告28。我不完全理解这个差异,但是PKWare规范在额外字段中提到了“ 2字节标识符和2字节数据大小字段”。无论如何,将fudge值添加为4可以使其正常工作,但是我想了解发生了什么事-添加随机fudge值以使事情正常运行不会解决: offset += 30 + entry.getName().length() + extra + 4;

1 个答案:

答案 0 :(得分:0)

我的一般方法是合理的,但是由于Java的ZipFile返回的细节不足而受到阻碍。例如,有时在下一个本地标头开始之前,压缩数据的末尾还有16个字节。 ZipFile对此无能为力。

zip4j似乎是一个更好的选择,它提供了以下方法: header.getOffsetLocalHeader()删除了一些容易出错的计算。