在foreach中创建数据框的替代方法

时间:2019-02-14 06:32:13

标签: apache-spark hive rdd

我有一个JavaPairRDD,要在其上进行迭代,执行一些操作并将输出存储到Hive。当前,我正在尝试在foreach内创建一个Dataframe,这将引发异常,因为无法在foreach内创建Dataframe。那有什么替代方法呢?

        JavaPairRDD<Long, Iterable<EmployeeDetail>>  employeeDetailPairList = fetchEmployeeDetailData();
        List<EmployeeZone> employeeZoneFCList = fetchEmployeeZoneData();
        employeeDetailPairList.foreach(employeeDetailPair ->  {
            Iterable<EmployeeDetail> employeeDetailList = employeeDetailPair._2;
            Set<String> zipCodeSet = StreamSupport.stream(employeeDetailList.spliterator(), false).map(e -> e.getZipCode()).collect(Collectors.toSet());
            List<EmployeeZone> employeeZoneFilteredList = employeeZoneList.stream().filter(e -> zipCodeSet.contains(String.valueOf(e.getLoc()))).collect(Collectors.toList());
            List<Output> outputListList = processEmployeeData(employeeZoneFilteredList);
            outputListList = addWeekStartDay(outputListList, weekStartDay);
            if(outputListList != null && this.getSession()!= null) {
              Dataset<Row> recordsDF = this.getSession().sqlContext().createDataFrame(outputListList, Output.class);
              recordsDF.write().insertInto(SHIPCODE_PREFERRED_FC_HIVE_TABLE);
            }
        });

1 个答案:

答案 0 :(得分:0)

您不能在转换内创建DataFrame。实现此目的的唯一方法是将Hivetable与RDD结合在一起。这样可以避免查找并执行所需的操作。希望它能回答您的问题