Question

我正在尝试使用下面的代码将索引列添加到数据集，以将其转换为JavaPairRDD。

// ds is a Dataset<Row>
JavaPairRDD<Row, Long> indexedRDD = ds.toJavaRDD()
    .zipWithIndex();

// Now I am converting JavaPairRDD to JavaRDD as below.
JavaRDD<Row> rowRDD = indexedRDD
    .map(tuple -> RowFactory.create(tuple._1(),tuple._2().intValue()));

// I am converting the RDD back to dataframe and it doesnt work.
Dataset<Row> authDf = session
    .createDataFrame(rowRDD, ds.schema().add("ID", DataTypes.IntegerType));

// Below is the ds schema(Before adding the ID column).
ds.schema()

root
 |-- user: short (nullable = true)
 |-- score: long (nullable = true)
 |-- programType: string (nullable = true)
 |-- source: string (nullable = true)
 |-- item: string (nullable = true)
 |-- playType: string (nullable = true)
 |-- userf: integer (nullable = true)

上面的代码抛出以下错误消息：

**Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 
times, most  recent failure: Lost task 0.3 in stage 21.0 (TID 658, 
sl73caehdn0406.visa.com, executor 1):

java.lang.RuntimeException: 
Error while encoding: java.lang.RuntimeException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema is not 
a valid external type for schema of smallint**

Answer 1

您在第二条语句中创建的元组由两列组成：一列是对象（由初始数据集中的所有列组成），第二列是整数。第二个元组列进入第二个结果列，该列的类型为long。第一个元组列进入第一个结果列，该列的类型很短-作为对象，即GenericRowWithSchema，这会导致错误。

您应该使用7个参数（每个结果列一个）来设置RowFactory.create（）。

将Dataset <row>转换为JavaRDD <row>然后转换为Dataframe时发生RuntimeException

1 个答案: