Question

示例框架代码如下所示，其中我基本上从bigquery读取RDD并选择my_field_name值为null的所有数据点

    JavaPairRDD<String, GenericData.Record> input = sc
            .newAPIHadoopRDD(hadoopConfig, AvroBigQueryInputFormat.class, LongWritable.class, GenericData.Record.class)
            .mapToPair( tuple -> {
                GenericData.Record record = tuple._2;
                Object rawValue = record.get(my_field_name); // Problematic !! want to get my_field_name of this bq row, but just gave something not making sense
                String partitionValue = rawValue == null ? "EMPTY" : rawValue.toString();
                return new Tuple2<String, GenericData.Record>(partitionValue, record);
            }).cache();
    JavaPairRDD<String, GenericData.Record> emptyData = 
            input.filter(tuple -> StringUtils.equals("EMPTY", tuple._1));
    emptyData.values().saveAsTextFile(my_file_path)

但输出RDD完全出乎意料。特别是my_field_name的值似乎是完全随机的。经过一些调试之后，似乎过滤就是按照预期进行的，但问题在于我从GenericData.Record中提取的值（基本上是record.get(my_field_name)）似乎是完全随机的。

因此我从AvroBigQueryInputFormat切换到GsonBigQueryInputFormat 相反，在json中读取bq，这段代码似乎正常工作。

然而，理想情况下我真的想要使用Avro（这应该比处理json快得多）但是它在我的代码中的当前行为是完全令人不安的。我刚刚使用AvroBigQueryInputFormat错了吗？

通过来自spark的AvroBigQueryInputFormat读取bq表会产生意外行为（使用java）

0 个答案: