CSV输入上的SparkSQL Correlation.corr()导致空指针异常

时间:2018-12-04 08:17:40

标签: apache-spark apache-spark-sql

我是Spark的新手。我正在使用Spark 2.4.0和Java 10.0.2。我试图在csv输入上运行相关性分析。此csv输入的任何行中都没有null /空白。我首先建立一个详细描述csv文件中列数据类型的模式。然后,我尝试运行Correlation.corr()。

import org.apache.spark.api.java.JavaRDD;
            import org.apache.spark.api.java.JavaSparkContext;
            import org.apache.spark.ml.stat.Correlation;
            import org.apache.spark.mllib.linalg.Matrix;
            import org.apache.spark.mllib.linalg.Vector;
            import org.apache.spark.rdd.RDD;
            import org.apache.spark.sql.Dataset;
            import org.apache.spark.sql.Row;
            import org.apache.spark.sql.SparkSession;
            import org.apache.spark.sql.DataFrameStatFunctions;
            import org.apache.spark.mllib.stat.correlation.Correlation.*;
            import org.apache.spark.sql.types.DataTypes;
            import org.apache.spark.sql.types.StructField;
            import org.apache.spark.sql.types.StructType;
            ​
            import static org.apache.spark.mllib.stat.correlation.Correlations.corr;
            import static org.apache.spark.sql.types.DataTypes.IntegerType;
            import static org.apache.spark.sql.types.DataTypes.StringType;
            ​
            ​
            public class myApp {
              public static void main(String[] args) {
                //start the spark session
                SparkSession spark = SparkSession
                    .builder()
                    .appName("My App")
                    .getOrCreate();
            ​
                //schema
                StructType customSchema = customSchema = new StructType(new StructField[] {
                    new StructField("Id",DataTypes.IntegerType, true, null),
                    new StructField("MSSubClass",DataTypes.StringType, true, null),
                    new StructField("MSZoning",DataTypes.StringType, true, null)
                });
            ​
                //Load Boston csv to dataset
                Dataset<Row> boston_csv = spark.read()
                    .format("csv")
                    .option("header","true")
                    .schema(customSchema)
                    .load("input_file.csv");
            ​
            ​
                //Correlation between columns?
                Dataset<Row> correlated= Correlation.corr(boston_csv, "MSZoning", "pearson");
                correlated.show();

              }
            }

这会导致以下Null指针异常:

                2018-12-04 08:10:49 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
            Exception in thread "main" java.lang.NullPointerException
                    at org.apache.spark.sql.catalyst.expressions.AttributeReference.hashCode(namedExpressions.scala:263)
                    at scala.collection.mutable.FlatHashTable$class.findElemImpl(FlatHashTable.scala:129)
                    at scala.collection.mutable.FlatHashTable$class.containsElem(FlatHashTable.scala:124)
                    at scala.collection.mutable.HashSet.containsElem(HashSet.scala:40)
                    at scala.collection.mutable.HashSet.contains(HashSet.scala:57)
                    at scala.collection.GenSetLike$class.apply(GenSetLike.scala:44)
                    at scala.collection.mutable.AbstractSet.apply(Set.scala:46)
                    at scala.collection.SeqLike$$anonfun$distinct$1.apply(SeqLike.scala:506)
                    at scala.collection.immutable.List.foreach(List.scala:392)
                    at scala.collection.SeqLike$class.distinct(SeqLike.scala:505)
                    at scala.collection.AbstractSeq.distinct(Seq.scala:41)
                    at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
                    at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq$$anonfun$unique$1.apply(package.scala:147)
                    at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
                    at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
                    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
                    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
                    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
                    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
                    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
                    at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245)
                    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
                    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
                    at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.unique(package.scala:147)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:897)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:897)
                    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
                    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
                    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:897)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:957)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:957)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
                    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
                    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
                    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
                    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
                    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
                    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
                    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
                    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
                    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:957)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:900)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
                    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
                    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:900)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:758)
                    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
                    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
                    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
                    at scala.collection.immutable.List.foldLeft(List.scala:84)
                    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
                    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
                    at scala.collection.immutable.List.foreach(List.scala:392)
                    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
                    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
                    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
                    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
                    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
                    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
                    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
                    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3407)
                    at org.apache.spark.sql.Dataset.select(Dataset.scala:1335)
                    at org.apache.spark.sql.Dataset.select(Dataset.scala:1353)
                    at org.apache.spark.ml.stat.Correlation$.corr(Correlation.scala:70)
                    at org.apache.spark.ml.stat.Correlation.corr(Correlation.scala)
                    at edu.ucr.cs.cs226.groupC.HousingPriceFeatureCorrelation.main(HousingPriceFeatureCorrelation.java:129)
                    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
                    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
                    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
                    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
                    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
                    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
                    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
                    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
                    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我很难找出原因。任何建议将不胜感激。

0 个答案:

没有答案