我使用AWS S3作为进入Spark群集的数据的备份存储。数据每秒进入一次,并在读取10秒数据时处理。包含10秒数据的RDD使用
存储到S3rdd.saveAsObjectFile(s3URL + dateFormat.format(new Date()));
这意味着我们每天都会以
的格式将大量文件添加到S3S3URL / 2017/07/23/12/00/10,S3URL / 2017/07/23/12/00/20等
从这里可以很容易地恢复RDD,这是一个
JavaRDD<'字节[]>
使用
sc.objectFile或AmazonS3 API
问题是,为了减少迭代所需的文件数量,我们运行每天通过每个文件的cron作业,将数据组合在一起并将新的RDD存储到S3。这样做如下:
List<byte[]> dataList = new ArrayList<>(); // A list of all read messages
/* Get all messages from S3 and store them in the above list */
try {
final ListObjectsV2Request req = new ListObjectsV2Request().withBucketName("bucketname").withPrefix("logs/" + dateString);
ListObjectsV2Result result;
do {
result = s3Client.listObjectsV2(req);
for (S3ObjectSummary objectSummary :
result.getObjectSummaries()) {
System.out.println(" - " + objectSummary.getKey() + " " +
"(size = " + objectSummary.getSize() +
")");
if(objectSummary.getKey().contains("part-00000")){ // The messages are stored in files named "part-00000"
S3Object object = s3Client.getObject(
new GetObjectRequest(objectSummary.getBucketName(), objectSummary.getKey()));
InputStream objectData = object.getObjectContent();
byte[] byteData = new byte[(int) objectSummary.getSize()]; // The size of the messages differ
objectData.read(byteData);
dataList.add(byteData); // Add the message to the list
objectData.close();
}
}
/* When iterating, messages are split into chunks called continuation tokens.
* All tokens have to be iterated through to get all messages. */
System.out.println("Next Continuation Token : " + result.getNextContinuationToken());
req.setContinuationToken(result.getNextContinuationToken());
} while(result.isTruncated() == true );
} catch (AmazonServiceException ase) {
System.out.println("Caught an AmazonServiceException, " +
"which means your request made it " +
"to Amazon S3, but was rejected with an error response " +
"for some reason.");
System.out.println("Error Message: " + ase.getMessage());
System.out.println("HTTP Status Code: " + ase.getStatusCode());
System.out.println("AWS Error Code: " + ase.getErrorCode());
System.out.println("Error Type: " + ase.getErrorType());
System.out.println("Request ID: " + ase.getRequestId());
} catch (AmazonClientException ace) {
System.out.println("Caught an AmazonClientException, " +
"which means the client encountered " +
"an internal error while trying to communicate" +
" with S3, " +
"such as not being able to access the network.");
System.out.println("Error Message: " + ace.getMessage());
} catch (IOException e) {
e.printStackTrace();
}
JavaRDD<byte[]> messages = sc.parallelize(dataList); // Loads the messages into an RDD
messages.saveAsObjectFile("S3URL/daily_logs/" + dateString);
这一切都运行正常,但现在我不知道如何再次将数据实际恢复到可管理状态。如果我使用
sc.objectFile
恢复RDD我最终得到了JavaRDD&lt;'byte []&gt;其中byte []实际上是JavaRDD&lt;'byte []&gt;在自身。如何从位于JavaRDD&lt;'byte []&gt;中的byte []恢复嵌套的JavaRDD?
我希望这是有道理的,我很感激任何帮助。在最糟糕的情况下,我必须提出另一种备份数据的方法。
祝你好运 的Mathias
答案 0 :(得分:0)
我解决了它而不是存储嵌套的RDD我将所有byte []平面映射到单个JavaRDD中并存储了那个。