Spark Java DataFrame - 将多个文件读入多个数据集时的ClassCastException

时间:2018-03-17 05:25:27

标签: java apache-spark spark-dataframe apache-spark-dataset

我试图将单独文件中的数据读入单独的RDD,然后将其转换为DataFrames(使用Java API)。

使用单个POJO只使用一个数据集时没有遇到任何问题,但是当我尝试读取映射到不同POJO的附加数据集时,我开始遇到这个问题:

18/03/17 00:58:28 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException: TestMain$PageView cannot be cast to TestMain$BlacklistedPage
    at org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
    at org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

以下是一些似乎复制我遇到的问题的测试代码(使用并行数据而不是textFile输入)。我正在使用Spark 2.2.1。有什么关于我滥用的SparkSession吗?

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;
import java.util.Arrays;
import java.util.Objects;

public class TestMain {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder().master("local").appName("demo").getOrCreate();

        JavaSparkContext sparkContext = JavaSparkContext.fromSparkContext(spark.sparkContext());

        JavaRDD<String> rawBlacklisted = sparkContext.parallelize(Arrays.asList("af .sy", "af 2009"));
        JavaRDD<String> raw = sparkContext.parallelize(Arrays.asList("ab .sy 100", "af 2009 10", "aa title 5"));

        JavaRDD<BlacklistedPage> blackListedPages = rawBlacklisted.map(BlacklistedPage::parse).filter(Objects::nonNull);
        JavaRDD<PageView> rawPageViews = raw.map(PageView::parse).filter(Objects::nonNull);

        Dataset<Row> first = spark.createDataFrame(blackListedPages, BlacklistedPage.class);
        Dataset<Row> second = spark.createDataFrame(rawPageViews, PageView.class);

        first.show(10);
        second.show(10);
    }

    public static class BlacklistedPage implements Serializable {
        private String domainCode;
        private String pageTitle;

        static  BlacklistedPage parse(String line) {
            String[] data = line.split(" ");
            if (data.length < 2) {
                return null;
            }
            return new BlacklistedPage(data[0], data[1]);
        }

        BlacklistedPage(String domainCode, String pageTitle) {
            this.domainCode = domainCode;
            this.pageTitle = pageTitle;
        }

        // getters and setters omitted for clarity
    }

    public static class PageView implements Serializable {
        private String domainCode;
        private String pageTitle;
        private Integer viewCount;

        static PageView parse(String line) {
            String[] data = line.split(" ");

            if (data.length < 3) {
                return null;
            }

            return new PageView(data[0], data[1], Integer.parseInt(data[2]));
        }

        PageView(String domainCode, String pageTitle, Integer viewCount) {
            this.domainCode = domainCode;
            this.pageTitle = pageTitle;
            this.viewCount = viewCount;
        }

        // getters and setters omitted for clarity
    }
}

1 个答案:

答案 0 :(得分:0)

嗨,我刚刚修改了部分代码,现在它似乎工作正常:

vagrant