将类型化的JavaRDD转换为Row JavaRDD

时间:2016-10-15 20:41:00

标签: apache-spark dataframe rdd

我正在尝试将输入的rdd 转换为行rdd,然后从中创建数据帧。执行代码时抛出异常

代码:

JavaRDD<Counter> rdd = sc.parallelize(counters);
JavaRDD<Row> rowRDD = rdd.map((Function<Counter, Row>) RowFactory::create);

//I am using some schema here based on the class Counter
DataFrame df = sqlContext.createDataFrame(rowRDD, getSchema());
marineDF.show(); //throws Exception 

从typed rdd到row rdd的转换是否保留了行工厂中的顺序?如果不是我如何确定?

班级代码:

class Counter {
  long vid;
  byet[] bytes; 
  List<B> blist;
}
class B {
  String id;
  long count;
}

模式:

private StructType getSchema() {
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("vid", DataTypes.LongType, false));
fields.add(DataTypes.createStructField("bytes",DataTypes.createArrayType(DataTypes.ByteType), false));

List<StructField> bFields = new ArrayList<>();
bFields.add(DataTypes.createStructField("id", DataTypes.StringType, false));
bFields.add(DataTypes.createStructField("count", DataTypes.LongType, false));

StructType bclasSchema = DataTypes.createStructType(bFields);

fields.add(DataTypes.createStructField("blist", DataTypes.createArrayType(bclasSchema, false), false));
StructType schema = DataTypes.createStructType(fields);
return schema;
}

失败,但有例外:

java.lang.ClassCastException: test.spark.SampleTest$A cannot be cast to java.lang.Long

    at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:42)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:221)
    at org.apache.spark.sql.catalyst.CatalystTypeConverters$LongConverter$.toScalaImpl(CatalystTypeConverters.scala:367)

1 个答案:

答案 0 :(得分:4)

事情是这里没有转换。创建Row时,它可以接受任意Object。它按原样放置。所以它不等同于DataFrame创作:

spark.createDataFrame(rdd, Counter.class); 

Dataset<Counter>创作:

Encoder<Counter> encoder = Encoders.bean(Counter.class);
spark.createDataset(rdd, encoder);

使用bean类时。

所以RowFactory::create在这里不适用。如果您想要传递RDD<Row>,则所有值都应该已经以可以直接与DataFrame required type mapping一起使用的形式表示。这意味着您必须将以下形状的每个Counter明确映射到Row

Row(vid, bytes, List(Row(id1, count1), ..., Row(idN, countN))

并且您的代码应该等同于:

JavaRDD<Row> rows = counters.map((Function<Counter, Row>) cnt -> {
  return RowFactory.create(
    cnt.vid, cnt.bytes,
    cnt.blist.stream().map(b -> RowFactory.create(b.id, b.count)).toArray()
  );
});

Dataset<Row> df = sqlContext.createDataFrame(rows, getSchema());