我试图将单独文件中的数据读入单独的RDD,然后将其转换为DataFrames(使用Java API)。
使用单个POJO只使用一个数据集时没有遇到任何问题,但是当我尝试读取映射到不同POJO的附加数据集时,我开始遇到这个问题:
18/03/17 00:58:28 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassCastException: TestMain$PageView cannot be cast to TestMain$BlacklistedPage
at org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
at org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
以下是一些似乎复制我遇到的问题的测试代码(使用并行数据而不是textFile输入)。我正在使用Spark 2.2.1。有什么关于我滥用的SparkSession吗?
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.io.Serializable;
import java.util.Arrays;
import java.util.Objects;
public class TestMain {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local").appName("demo").getOrCreate();
JavaSparkContext sparkContext = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<String> rawBlacklisted = sparkContext.parallelize(Arrays.asList("af .sy", "af 2009"));
JavaRDD<String> raw = sparkContext.parallelize(Arrays.asList("ab .sy 100", "af 2009 10", "aa title 5"));
JavaRDD<BlacklistedPage> blackListedPages = rawBlacklisted.map(BlacklistedPage::parse).filter(Objects::nonNull);
JavaRDD<PageView> rawPageViews = raw.map(PageView::parse).filter(Objects::nonNull);
Dataset<Row> first = spark.createDataFrame(blackListedPages, BlacklistedPage.class);
Dataset<Row> second = spark.createDataFrame(rawPageViews, PageView.class);
first.show(10);
second.show(10);
}
public static class BlacklistedPage implements Serializable {
private String domainCode;
private String pageTitle;
static BlacklistedPage parse(String line) {
String[] data = line.split(" ");
if (data.length < 2) {
return null;
}
return new BlacklistedPage(data[0], data[1]);
}
BlacklistedPage(String domainCode, String pageTitle) {
this.domainCode = domainCode;
this.pageTitle = pageTitle;
}
// getters and setters omitted for clarity
}
public static class PageView implements Serializable {
private String domainCode;
private String pageTitle;
private Integer viewCount;
static PageView parse(String line) {
String[] data = line.split(" ");
if (data.length < 3) {
return null;
}
return new PageView(data[0], data[1], Integer.parseInt(data[2]));
}
PageView(String domainCode, String pageTitle, Integer viewCount) {
this.domainCode = domainCode;
this.pageTitle = pageTitle;
this.viewCount = viewCount;
}
// getters and setters omitted for clarity
}
}
答案 0 :(得分:0)
嗨,我刚刚修改了部分代码,现在它似乎工作正常:
vagrant