Spark - 恢复嵌套保存的RDD

时间:2017-07-23 11:35:57

标签: java apache-spark amazon-s3 rdd

我使用AWS S3作为进入Spark群集的数据的备份存储。数据每秒进入一次,并在读取10秒数据时处理。包含10秒数据的RDD使用

存储到S3
rdd.saveAsObjectFile(s3URL + dateFormat.format(new Date()));

这意味着我们每天都会以

的格式将大量文件添加到S3
  

S3URL / 2017/07/23/12/00/10,S3URL / 2017/07/23/12/00/20等

从这里可以很容易地恢复RDD,这是一个

  

JavaRDD<'字节[]>

使用

  

sc.objectFile或AmazonS3 API

问题是,为了减少迭代所需的文件数量,我们运行每天通过每个文件的cron作业,将数据组合在一起并将新的RDD存储到S3。这样做如下:

List<byte[]> dataList = new ArrayList<>(); // A list of all read messages
    /* Get all messages from S3 and store them in the above list */
    try {
        final ListObjectsV2Request req = new ListObjectsV2Request().withBucketName("bucketname").withPrefix("logs/" + dateString);
        ListObjectsV2Result result;
        do {               
           result = s3Client.listObjectsV2(req);
           for (S3ObjectSummary objectSummary : 
               result.getObjectSummaries()) {
               System.out.println(" - " + objectSummary.getKey() + "  " +
                       "(size = " + objectSummary.getSize() + 
                       ")");
               if(objectSummary.getKey().contains("part-00000")){ // The messages are stored in files named "part-00000"
                   S3Object object = s3Client.getObject(
                           new GetObjectRequest(objectSummary.getBucketName(), objectSummary.getKey()));
                   InputStream objectData = object.getObjectContent();
                   byte[] byteData = new byte[(int) objectSummary.getSize()]; // The size of the messages differ
                   objectData.read(byteData);
                   dataList.add(byteData); // Add the message to the list
                   objectData.close();
               }
           }
           /* When iterating, messages are split into chunks called continuation tokens.
            * All tokens have to be iterated through to get all messages. */
           System.out.println("Next Continuation Token : " + result.getNextContinuationToken());
           req.setContinuationToken(result.getNextContinuationToken());
        } while(result.isTruncated() == true ); 
     } catch (AmazonServiceException ase) {
        System.out.println("Caught an AmazonServiceException, " +
                "which means your request made it " +
                "to Amazon S3, but was rejected with an error response " +
                "for some reason.");
        System.out.println("Error Message:    " + ase.getMessage());
        System.out.println("HTTP Status Code: " + ase.getStatusCode());
        System.out.println("AWS Error Code:   " + ase.getErrorCode());
        System.out.println("Error Type:       " + ase.getErrorType());
        System.out.println("Request ID:       " + ase.getRequestId());
    } catch (AmazonClientException ace) {
        System.out.println("Caught an AmazonClientException, " +
                "which means the client encountered " +
                "an internal error while trying to communicate" +
                " with S3, " +
                "such as not being able to access the network.");
        System.out.println("Error Message: " + ace.getMessage());
    } catch (IOException e) {
        e.printStackTrace();
    }
    JavaRDD<byte[]> messages = sc.parallelize(dataList); // Loads the messages into an RDD
    messages.saveAsObjectFile("S3URL/daily_logs/" + dateString);

这一切都运行正常,但现在我不知道如何再次将数据实际恢复到可管理状态。如果我使用

  

sc.objectFile

恢复RDD我最终得到了JavaRDD&lt;'byte []&gt;其中byte []实际上是JavaRDD&lt;'byte []&gt;在自身。如何从位于JavaRDD&lt;'byte []&gt;中的byte []恢复嵌套的JavaRDD?

我希望这是有道理的,我很感激任何帮助。在最糟糕的情况下,我必须提出另一种备份数据的方法。

祝你好运 的Mathias

1 个答案:

答案 0 :(得分:0)

我解决了它而不是存储嵌套的RDD我将所有byte []平面映射到单个JavaRDD中并存储了那个。