将Spark数据帧转换为Bean类的List

时间:2017-10-09 05:12:08

标签: java apache-spark spark-dataframe

我正在阅读CSV文件并创建数据框。我想要的输出是List。我的数据框看起来像

+---------------------+-----------------------+---------------------+
|last_value           |processed_on           |notes                |
+---------------------+-----------------------+---------------------+
|2017-01-10 00:10:00.0|2017-10-09 08:32:33.689|2017-01-04,2017-05-09|
|2016-01-20 00:05:00.0|2017-10-09 08:33:18.567|2017-01-10,2017-01-20|
+---------------------+-----------------------+---------------------+

我有一个Bean类,它将存储一行输出。

public class MyClass implements Serializable {
   private String last_value;
   private String proccessed_on;
   private String notes;

   public CheckPointEntity() {}

   public void setLastValue(String last_value)
       {this.last_value=last_value;}

   public String getLastValue(){return last_value;}

   public void setProccessedOn(String proccessed_on){
    this.proccessed_on=proccessed_on;
   }

   public String getProccessed(){
    return proccessed_on;
   }

   public void setNotes(String notes){
    this.notes=notes;
   }
   public String getNotes(){
    return notes;
   }
}

我正在尝试将数据框转换为List

Dataset<Row> df  = SparkSession.read()
            .format("csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .load("/tmp/somepath");;


df.map((MapFunction<Row, MyClass>)
                    Row -> setBean(
                            (String) Row.get(schemaVal.get("last_value")),
                            (String) Row.get(schemaVal.get("processed_on")),
                            (String) Row.get(schemaVal.get("notes"))
                    ),
            Encoders.bean(MyClass.class)
    ).collectAsList();

但我收到错误

java.lang.NullPointerException
at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
at org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)

这里有什么问题?应该采用不同的方法吗?

0 个答案:

没有答案