Question

我正在尝试使用数据集创建RDD，但无法找到映射到每个数据集行的方法。

Dataset<POJO> df1 = session.read().parquet(tableName).as(Encoders.bean(POJO.class));

使用以下方法

    JavaRDD<List<Tuple3<Long, Integer, Double>>> tempDatas1 = df1.map(r -> new MapFunction<POJO, List<Tuple3<Long, Integer, Double>>>(){
        //@Override
        public List<Tuple3<Long, Integer, Double>> call(POJO row) throws Exception
        {

        // Get the sample property, remove leading and ending spaces and split it by comma
        // to get each sample individually
        List<Tuple2<String, Integer>> samples = zipWithIndex((row.getSamples().trim().split(",")));

        // Gets the unique identifier for that s.
        Long snp = row.getPos();

        // Calculates the hamming distance.
        return samples.stream().map(t -> {
            String alleles = t._1();
            Integer patient = t._2();

            List<String> values = Arrays.asList(alleles.split("\\|"));

            Double firstAllele = Double.parseDouble(values.get(0));
            Double secondAllele = Double.parseDouble(values.get(1));

            // Returns the initial S id, p id and the distance in form of Tuple.
            return new Tuple3<>(snp, patient, firstAllele + secondAllele);
        }).collect(Collectors.toList());
        }
    });

cannot resolve method map(<lambda expression>)中的map收到df1.map(r ->错误。

Answer 1

请使用df1.toJavaRDD（）或df1.rdd（），而不是直接在数据集的顶部写入地图。最好先将数据集转换为rdd并将其映射并再次将输出存储在rdd中。因为数据集映射不会将JavaRDD或JavaPairRDD作为转换的输出，而不将数据集首先转换为rdd。

Spark Map to Dataset Row

1 个答案: