将Spark DataSet <row>转换为Java Pojo类

时间:2018-11-06 08:58:32

标签: apache-spark java-8 apache-spark-sql

我正在尝试将DataSet转换为java对象。 架构就像

root
 |-- deptId: long (nullable = true)
 |-- depNameName: string (nullable = true)
 |-- employee: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- phno: Long (nullable = true)
 |    |    |    |-- element: integer (containsNull = true)

我创建了pojo类Like。

class Department {
  private Long deptId;
  private String depName;
  private List<Employee> employess;
  //with getter setters and no argument constructor
  }



class Employee {
  private String firstName;
  private String lastName;
  private List<Long> phno;
  //With getter setter and no argument constructor 
 }

现在这是我正在尝试进行转换的代码。

  Dataset<Row> ds = this.spark.read().parquet(Parquet file path);
  Dataset<Department> departmentDataset = 
  ds.as(Encoders.bean(Department.class));
  JavaRDD<String> rdd = 

departmentDataset.toJavaRDD().map((Function<Department, String>) v -> {

            StringBuilder sb = new StringBuilder();
            sb.append("deptId").append(v.getDeptID());
            if(!CollectionUtil.isListNullOrEmpty(v.employee))

   sb.append("FirstName").append(v.getEmployee().get(0).getName);

   if(!CollectionUtil.isListNullOrEmpty(v.getEmployee().getPhno()))
            sb.append("Ph 
    number").append(v.getEmployee().getPhno().get(0));

            return sb.toString();
        });

但是此代码不起作用。 org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException失败。但是我可以使用基于行的构造函数对此进行转换,在这里我需要对列名进行硬编码。 喜欢

public Department(Row row)
 {
  this.employees  = new ArrayList<Employee>
  this.deptaID  = (Long)row.getAs("deptId");
  List rowList = (List)row.getList(row.fieldIndex("employee"));
    if (rowList!=null) {
      for (Row r : rowList) {
        Employee obj = new Employee(r);
        employees.add(obj);
      }
    }


 public Employee(Row row)
 {
 this.phno  = new ArrayList<Long>
 this.firstName  = (Long)row.getAs("firstName");
  List rowList = (List)row.getList(row.fieldIndex("phno"));
    if (rowList!=null) {
      for (Row r : rowList) {          
        phno.add(r);
      }
    }

 JavaRDD<Department> rdd =  ds.toJavaRDD().map(Department::new);
 JavaRDD<String> rdd     = rdd.map((Function<Department, String>) v -> {

                StringBuilder sb = new StringBuilder();
                sb.append("deptId").append(v.getDeptID());
                if(!CollectionUtil.isListNullOrEmpty(v.employee))

sb.append("FirstName").append(v.getEmployee().get(0).getName);

if(!CollectionUtil.isListNullOrEmpty(v.getEmployee().getPhno()))
                sb.append("Ph 
number").append(v.getEmployee().getPhno().get(0));

                return sb.toString();
            });

通过这种方法,我成功了。但是它包括很多Schema名称的硬编码。因此,正在寻找更优雅的解决方案。

请提出针对此问题的最佳解决方案。

0 个答案:

没有答案