我有一份工作,可以将许多文件输出到hfds。这些文件是.gz压缩 但是有2-3个块的文件
我使用以下代码将文件从hdfs上传到s3
FileSystem fileSystem = fileSystem = FileSystem.get(config);
Path srcHdfsDirPath = new Path(sourcePath);
Path[] paths = FileUtil.stat2Paths(fileSystem.listStatus(srcHdfsDirPath, new OutputFilesFilter()));
for (Path srcPath : paths) {
InputStream in = fileSystem.open(srcPath);
try {
uploadFileToAmazon(srcPath.getName(), in, fileSystem.getFileStatus(srcPath).getLen(), destinationPath + "/" + srcPath.getName(), s3Client);
}finally {
in.close();
}
}
---------
protected void uploadFileToAmazon(InputStream source, long length, String amazonPath, AmazonS3Client s3Client) throws UploadFailed {
List<PartETag> partETags = new ArrayList<PartETag>();
InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(
bucketName, amazonPath);
InitiateMultipartUploadResult initResponse =
s3Client.initiateMultipartUpload(initRequest);
// Step 2: Upload parts.
long filePosition = 0;
for (int i = 1; filePosition < length; i++) {
// Last part can be less than 5 MB. Adjust part size.
long currentPartSize = Math.min(partSize, (length - filePosition));
// Create request to upload a part.
UploadPartRequest uploadRequest = new UploadPartRequest()
.withBucketName(bucketName).withKey(amazonPath)
.withUploadId(initResponse.getUploadId()).withPartNumber(i)
.withFileOffset(filePosition)
.withInputStream(source)
.withPartSize(currentPartSize);
logger.info("upload part {}", i);
boolean anotherPass;
int attempts = 0;
do {
anotherPass = false; // assume everythings ok
try {
// Upload part and add response to our list.
partETags.add(s3Client.uploadPart(uploadRequest).getPartETag());
} catch (Exception e) {
anotherPass = true; // repeat
attempts++;
}
}while (anotherPass && attempts < maxLoadAttempts);
if (attempts == maxLoadAttempts){
throw new UploadFailed("failed to upload data to amazon");
}
filePosition += partSize;
}
logger.info("complete upload request");
// Step 3: Complete.
CompleteMultipartUploadRequest compRequest = new
CompleteMultipartUploadRequest(bucketName,
amazonPath,
initResponse.getUploadId(),
partETags);
s3Client.completeMultipartUpload(compRequest);
当上传第二部分时,所有这些都停留在第一个文件上。
线程转储显示以下内容:
java.lang.Thread.State: RUNNABLE
at com.amazonaws.services.s3.internal.InputSubstream.read(InputSubstream.java:71)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:73)
at com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream.read(MD5DigestCalculatingInputStream.java:98)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0x00000007cb4bc718> (a com.amazonaws.internal.SdkBufferedInputStream)
at com.amazonaws.internal.SdkBufferedInputStream.read(SdkBufferedInputStream.java:76)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:73)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:73)
at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:98)
at com.amazonaws.http.RepeatableInputStreamRequestEntity.writeTo(RepeatableInputStreamRequestEntity.java:153)
at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197)
at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doSendRequest(SdkHttpRequestExecutor.java:47)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:712)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:635)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:429)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:291)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3655)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:2770)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:2755)
at com.mycomp.agg.common.aws.AmazonS3Util.uploadFileToAmazon(AmazonS3Util.java:197)
我怀疑根本原因是gz(不可拆分压缩)。 是否有可能使此代码也适用于gz压缩和低级API?