我正在使用第How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)页上所述的设置
public void writeToParquet(List<GenericData.Record> recordsToWrite, String fileToWrite) throws IOException {
Configuration conf = new Configuration();
conf.set("fs.s3.awsAccessKeyId", "<access_key>");
conf.set("fs.s3.awsSecretAccessKey", "<secret_key>");
Path path = new Path(filePath);//filePath = "s3://bucket/folder/data.parquet"
try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
.<GenericData.Record>builder(path)
.withSchema(avroSchema)
.withConf(conf).withRowGroupSize(16 * 1024 * 1024).withPageSize(4 * 1024 * 1024)
.build()) {
for (GenericData.Record record : recordsToWrite) {
writer.write(record);
}
writer.close();
}
catch(Exception ex) {
ex.printStackTrace();
LOGGER.info("ParquetWriter Exception " + ex);
}
}
具有与上述@Sal相同版本的库。当我使用具有大约5条记录的小文件时,所有文件都可以很好地转换,但是我有大约800k(源文件大小为5GB +)的大量记录。我需要将它们转换为实木复合地板。
问题1:当我尝试将其存储在本地驱动器上并显式上传时,它几乎没有10条记录,输出文件大小约为5MB。
问题2:如上所述,当我尝试将其直接上传到S3时,我遇到了有线问题,我总是在第一次运行后得到异常
java.io.IOException: File already exists: s3://mybucket/output/folder/path/myfile.parquet
但是有趣的是,在该路径下文件不存在/不可见,仍然是此错误。
问题3:遇到例外情况
java.lang.NoSuchFieldError: workaroundNonThreadSafePasswdCalls
at org.apache.hadoop.io.nativeio.NativeIO.initNative(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.<clinit>(NativeIO.java:89)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:290)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:385)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:244)
at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:273)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:494)
请提前帮助