我正在尝试从S3的大型zip文件中提取特定项目,而不下载整个文件。
此处的Python解决方案:Read ZIP files from S3 without downloading the entire file似乎有效。通常,Java中等效的基础功能似乎不太宽松,因此我不得不进行各种调整。
在所附的代码中,您可以看到我已经成功获取了中央目录并将其写入一个临时文件,Java的ZipFile可以使用该临时文件来迭代CD中的zip条目。
但是,我一直坚持给单个条目充气。当前代码引发错误的标头异常。我是否需要为充气机提供本地文件头+压缩内容,还是仅压缩内容?我都尝试过,但是很明显我要么没有使用Inflator校正器,而且/或者没有提供预期的效果。
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.util.Arrays;
import java.util.Enumeration;
import java.util.zip.Inflater;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.util.IOUtils;
public class S3ZipTest {
private AmazonS3 s3;
public S3ZipTest(String bucket, String key) throws Exception {
s3 = getClient();
ObjectMetadata metadata = s3.getObjectMetadata(bucket, key);
runTest(bucket, key, metadata.getContentLength());
}
private void runTest(String bucket, String key, long size) throws Exception {
// fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty)
long start = size - 22;
GetObjectRequest req = new GetObjectRequest(bucket, key).withRange(start);
System.out.println("eocd start: " + start);
// fetch the end of cd record
S3Object s3Object = s3.getObject(req);
byte[] eocd = IOUtils.toByteArray(s3Object.getObjectContent());
// get the start offset and size of the central directory
int cdSize = byteArrayToLeInt(Arrays.copyOfRange(eocd, 12, 16));
int cdStart = byteArrayToLeInt(Arrays.copyOfRange(eocd, 16, 20));
System.out.println("cdStart: " + cdStart);
System.out.println("cdSize: " + cdSize);
// get the full central directory
req = new GetObjectRequest(bucket, key).withRange(cdStart, cdStart + cdSize - 1);
s3Object = s3.getObject(req);
byte[] cd = IOUtils.toByteArray(s3Object.getObjectContent());
// write the full dir + eocd:
ByteArrayOutputStream out = new ByteArrayOutputStream();
// write cd
out.write(cd);
// write eocd, resetting the cd start to 0 since that is
// where it will appear in our new temp file
byte[] b = leIntToByteArray(0);
eocd[16] = b[0];
eocd[17] = b[1];
eocd[18] = b[2];
eocd[19] = b[3];
out.write(eocd);
out.flush();
byte[] cdbytes = out.toByteArray();
// here we are writing the CD + EOCD to a temp file.
// ZipFile can read the entries from this file.
// ZipInputStream and commons compress will not- they seem upset that the data isn't actually here
File tempFile = new File("temp.zip");
FileOutputStream output = new FileOutputStream(tempFile);
output.write(cdbytes);
output.flush();
output.close();
ZipFile zipFile = new ZipFile(tempFile);
Enumeration<? extends ZipEntry> zipEntries = zipFile.entries();
long offset = 0;
while (zipEntries.hasMoreElements()) {
ZipEntry entry = (ZipEntry) zipEntries.nextElement();
long fileSize = 0;
long extra = entry.getExtra() == null ? 0 : entry.getExtra().length;
offset += 30 + entry.getName().length() + extra;
if (!entry.isDirectory()) {
fileSize = entry.getCompressedSize();
System.out.println(entry.getName() + " offset=" + offset + " size" + fileSize);
// not working
// getEntryContent(bucket, key, offset, fileSize, (int)entry.getSize());
}
offset += fileSize;
}
zipFile.close();
}
private void getEntryContent(String bucket, String key, long offset, long compressedSize, int fullSize) throws Exception {
//HERE is where things go bad.
//my guess was that we need to get past the local header for an entry to the actual
//start of deflated content and then read all the content and pass to the Inflator.
//this yields java.util.zip.DataFormatException: incorrect header check
System.out.print("reading " + compressedSize + " bytes starting from offset " + offset);
GetObjectRequest req = new GetObjectRequest(bucket, key).withRange(offset, offset + compressedSize);
S3Object s3Object = s3.getObject(req);
byte[] con = IOUtils.toByteArray(s3Object.getObjectContent());
Inflater inf = new Inflater();
inf.setInput(con);
byte[] inflatedContent = new byte[fullSize];
int sz = inf.inflate(inflatedContent);
System.out.println("inflated: " + sz);
// write inflatedContent to file or whatever...
}
public static int byteArrayToLeInt(byte[] b) {
final ByteBuffer bb = ByteBuffer.wrap(b);
bb.order(ByteOrder.LITTLE_ENDIAN);
return bb.getInt();
}
public static byte[] leIntToByteArray(int i) {
final ByteBuffer bb = ByteBuffer.allocate(Integer.SIZE / Byte.SIZE);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.putInt(i);
return bb.array();
}
protected AmazonS3 getClient() {
AmazonS3 client = AmazonS3ClientBuilder
.standard()
.withRegion("us-east-1")
.build();
return client;
}
public static void main(String[] args) {
try {
new S3ZipTest("alexa-public", "test.zip");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
修改
将Python计算与我的Java代码中的计算进行比较,我意识到Java减少了4。entry.getExtra().length
可能报告24,例如同一条目的zipinfo cmd行实用程序也是如此。 Python报告28。我不完全理解这个差异,但是PKWare规范在额外字段中提到了“ 2字节标识符和2字节数据大小字段”。无论如何,将fudge值添加为4可以使其正常工作,但是我想了解发生了什么事-添加随机fudge值以使事情正常运行不会解决:
offset += 30 + entry.getName().length() + extra + 4;
答案 0 :(得分:0)
我的一般方法是合理的,但是由于Java的ZipFile返回的细节不足而受到阻碍。例如,有时在下一个本地标头开始之前,压缩数据的末尾还有16个字节。 ZipFile对此无能为力。
zip4j似乎是一个更好的选择,它提供了以下方法:
header.getOffsetLocalHeader()
删除了一些容易出错的计算。