Question

尝试从java spark context中读取s3数据：

"mapreduce.input.fileinputformat.input.dir.recursive", "true"
jsc.textFile(filePath);

当我只有小时文件夹中的文件时，它对我有用：

s3://<year>/<month>/<day>/<hour>/<files>
jsc.textFile("s3://<year>/<month>/<day>");

现在，在S3平行于小时文件夹中，我们也可能有new_folder

s3://<year>/<month>/<day>/<hour>/<files>
s3://<year>/<month>/<day>/<hour>/<new_folder>/<files>

下面的代码忽略了new_folder下的文件

jsc.textFile("s3://<year>/<month>/<day>");

尝试使用多个正则表达式，但我的方法“isPathExist”始终返回false

jsc.textFile("s3n://<year>/<month>/<day>/*/<regular_expression>");

使用以下方法检查S3路径，返回false

private static boolean isPathExists(String folderPath, String bucket, String accessKey, String secret) {
    AWSCredentials cred = new BasicAWSCredentials(accessKey, secret);
    AmazonS3 s3 = new AmazonS3Client(cred);
    ObjectListing objectListing = s3
            .listObjects(new ListObjectsRequest().withBucketName(bucket).withPrefix(folderPath));
    return !objectListing.getObjectSummaries().isEmpty();
}

Answer 1

如果你想要所有子目录，那么使用两颗星。

jsc.textFile("s3://<year>/<month>/<day>/**");

这些目录中的文件，还有一个明星（我想）

jsc.textFile("s3://<year>/<month>/<day>/**/*");

使用sc.textFile从s3读取文件和子目录

1 个答案: