如何使用纯Java(包括日期和十进制类型)生成Parquet文件并上传到S3 [Windows](无HDFS)

时间:2017-11-17 16:21:18

标签: java apache-spark amazon-s3 avro parquet

我最近有一个要求,我需要生成Parquet文件,Apache Spark只能使用Java读取(不使用其他软件安装,如:Apache Drill,Hive,Spark等)。这些文件需要保存到S3,因此我将分享有关如何执行这两项操作的详细信息。


我将 NetBeans 用作我的IDE。


  • 为了将数据序列化为镶木地板,您必须选择一种流行的Java数据序列化框架:Avro,Protocol Buffers或Thrift(我将使用Avro(1.8.0),从中可以看出我们的镶木地板 - avro依赖)
  • 您需要使用支持Maven的IDE。这是因为上面的依赖关系有很多自己的依赖关系。 Maven将自动为您下载(如NuGet for VisualStudio)



  • hadoop.dll
  • winutils.exe

可以下载这些here。在这个例子中你需要2.8.1版本(由于parquet-avro 1.9.0)。

  1. 将这些文件复制到目标计算机上的 C:\ hadoop-2.8.1 \ bin
  2. 添加名为 HADOOP_HOME 的新系统变量(非用户变量),其值为 C:\ hadoop-2.8.1


  3. 修改系统路径变量(非用户变量),并将以下内容添加到最后:%HADOOP_HOME%\ bin

  4. 重新启动计算机以使更改生效。
  5. 如果此配置未正确完成,您将在运行时收到以下错误:java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z


    • 首先创建一个新的空Maven项目,并添加parquet-avro 1.9.0和hadoop-aws 2.8.2作为依赖项:dependency
    • 创建您可以编写代码的主类
    • 首先,您需要生成架构。现在据我所知,你无法在运行时以编程方式生成架构。 Schema.Parser 类&#39; parse()方法仅将文件或字符串文字作为参数,并且在创建模式后不允许您修改模式。 为了避免这种情况,我在运行时生成我的Schema JSON并解析它。下面是一个示例Schema:

    • 以下是加载到Apache Spark(2.2.0)中的数据: spark


    package com.mycompany.stackoverflow;
    import java.math.BigDecimal;
    import java.math.BigInteger;
    import java.math.RoundingMode;
    import org.apache.avro.Schema;
    import org.apache.avro.generic.GenericData;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.parquet.avro.AvroParquetWriter;
    import org.apache.parquet.hadoop.ParquetWriter;
    import org.apache.parquet.hadoop.metadata.CompressionCodecName;
    import org.joda.time.DateTime;
    import org.joda.time.DateTimeZone;
    import org.joda.time.Days;
    import org.joda.time.MutableDateTime;
    public class Main {
        public static void main(String[] args) {
            String schema = "{\"namespace\": \"org.myorganization.mynamespace\"," //Not used in Parquet, can put anything
                    + "\"type\": \"record\"," //Must be set as record
                    + "\"name\": \"myrecordname\"," //Not used in Parquet, can put anything
                    + "\"fields\": ["
                    + " {\"name\": \"myInteger\", \"type\": \"int\"}," //Required field
                    + " {\"name\": \"myString\",  \"type\": [\"string\", \"null\"]},"
                    + " {\"name\": \"myDecimal\", \"type\": [{\"type\": \"fixed\", \"size\":16, \"logicalType\": \"decimal\", \"name\": \"mydecimaltype1\", \"precision\": 32, \"scale\": 4}, \"null\"]},"
                    + " {\"name\": \"myDate\", \"type\": [{\"type\": \"int\", \"logicalType\" : \"date\"}, \"null\"]}"
                    + " ]}";
            Schema.Parser parser = new Schema.Parser().setValidate(true);
            Schema avroSchema = parser.parse(schema);
            GenericData.Record record = new GenericData.Record(avroSchema);
            record.put("myInteger", 1);
            record.put("myString", "string value 1");
            BigDecimal myDecimalValue = new BigDecimal("99.9999");
            //First we need to make sure the huge decimal matches our schema scale:
            myDecimalValue = myDecimalValue.setScale(4, RoundingMode.HALF_UP);
            //Next we get the decimal value as one BigInteger (like there was no decimal point)
            BigInteger myUnscaledDecimalValue = myDecimalValue.unscaledValue();
            //Finally we serialize the integer
            byte[] decimalBytes = myUnscaledDecimalValue.toByteArray();
            //We need to create an Avro 'Fixed' type and pass the decimal schema once more here:
            GenericData.Fixed fixed = new GenericData.Fixed(new Schema.Parser().parse("{\"type\": \"fixed\", \"size\":16, \"precision\": 32, \"scale\": 4, \"name\":\"mydecimaltype1\"}"));
            byte[] myDecimalBuffer = new byte[16];
            if (myDecimalBuffer.length >= decimalBytes.length) {            
                //Because we set our fixed byte array size as 16 bytes, we need to
                //pad-left our original value's bytes with zeros
                int myDecimalBufferIndex = myDecimalBuffer.length - 1;
                for(int i = decimalBytes.length - 1; i >= 0; i--){
                    myDecimalBuffer[myDecimalBufferIndex] = decimalBytes[i];
                //Save result
            } else {
                throw new IllegalArgumentException(String.format("Decimal size: %d was greater than the allowed max: %d", decimalBytes.length, myDecimalBuffer.length));
            //We can finally write our decimal to our record
            record.put("myDecimal", fixed);
            //Get epoch value
            MutableDateTime epoch = new MutableDateTime(0l, DateTimeZone.UTC);
            DateTime currentDate = new DateTime(); //Can take Java Date in constructor
            Days days = Days.daysBetween(epoch, currentDate);
            //We can write number of days since epoch into the record
            record.put("myDate", days.getDays());
            try {
               Configuration conf = new Configuration();
               conf.set("fs.s3a.access.key", "ACCESSKEY");
               conf.set("fs.s3a.secret.key", "SECRETKEY");
               //Below are some other helpful settings
               //conf.set("fs.s3a.endpoint", "s3.amazonaws.com");
               //conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider");
               //conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); // Not needed unless you reference the hadoop-hdfs library.
               //conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName()); // Uncomment if you get "No FileSystem for scheme: file" errors.
               Path path = new Path("s3a://your-bucket-name/examplefolder/data.parquet");
               //Use path below to save to local file system instead
               //Path path = new Path("data.parquet");
               try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter.<GenericData.Record>builder(path)
                       .withPageSize(4 * 1024 * 1024) //For compression
                       .withRowGroupSize(16 * 1024 * 1024) //For write buffering (Page size)
                       .build()) {
                   //We only have one record to write in our example
            } catch (Exception ex) { 