我想创建一个Dataframe(又名Dataset< Row>在Spark 2.1)中使用createDataframe(),当我传递一个List< Row>时,一切运行良好。 param,但是当我传递JavaRDD< Row>时抛出异常
[代码]
SparkSession ss = SparkSession.builder().appName("Spark Test").master("local[4]").getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("a", "b", "c")),
RowFactory.create(Arrays.asList("A", "B", "C"))
);
StructType schema = new StructType(new StructField[]{
DataTypes.createStructField("col_1", DataTypes.createArrayType(DataTypes.StringType), false)
});
当我尝试使用此代码时,一切正常
ss.createDataFrame(data, schema).show();
+---------+
| col_1|
+---------+
|[a, b, c]|
|[A, B, C]|
+---------+
但是当我将JavaRDD作为第一个参数传递时,它会引发异常
JavaRDD<Row> rdd = JavaSparkContext.fromSparkContext(ss.sparkContext()).parallelize(data);
ss.createDataFrame(rdd, schema).show(); // throws exception
[异常]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3, localhost, executor driver): java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>
mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object)), StringType), true), validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, col_1), ArrayType(StringType,true))) AS col_1#0
+- mapobjects(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object), staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object)), StringType), true), validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, col_1), ArrayType(StringType,true)))
:- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object)), StringType), true)
: +- validateexternaltype(lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object)), StringType)
: +- lambdavariable(MapObjects_loopValue0, MapObjects_loopIsNull1, ObjectType(class java.lang.Object))
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, col_1), ArrayType(StringType,true))
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, col_1)
+- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
+- input[0, org.apache.spark.sql.Row, true]
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:547)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:547)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
... 17 more
我们将不胜感激任何帮助
答案 0 :(得分:0)
如果由于Spark无法将 ArrayList 强制转换为 String [] 类型
而导致出现此问题的原因我将更改行以产生String []类型。尝试以下代码:
List<Row> data = Arrays.asList(
Arrays.asList("a", "b", "c"),
Arrays.asList("A", "B", "C")
).stream().map(r -> {
String[] arr = r.toArray(new String[r.size()]);
return RowFactory.create( new Object[]{arr});
}).collect(Collectors.toList());