重载方法值corr与替代方案

时间:2017-02-15 18:23:00

标签: scala apache-spark

我正在尝试计算两个功能之间的相关性,这两个功能是从两个单独的文本文件中读取的,如下所示。

import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.stat.Statistics
import scala.io.Source

object Corr {
     def main() {
            val sparkSession = SparkSession.builder
                .master("local")
                .appName("Correlation")
                .getOrCreate()

            val sc = sparkSession.sparkContext


            val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray
            val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray

            val feature_1_dist = sc.parallelize(feature_1)
            val feature_2_dist = sc.parallelize(feature_2)


            val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
            println(s"Correlation is: $correlation")
      }
} 

Corr.main()

但是,我收到以下错误:

overloaded method value corr with alternatives:
  (x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double],method: String)scala.Double <and>
  (x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double],method: String)scala.Double
 cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String], String)
        val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")

我想做的事情看起来与示例here非常相似,但我无法弄明白。

1 个答案:

答案 0 :(得分:2)

正如错误消息中所述,您需要@JSFunction public void log(Object messages) { if(messages instanceof List) { console.log(((List)messages).toArray()); } else if (messages instanceof Object[]) { console.log(((Object[])messages)); } else { console.log(new Object[]{messages}); } } ,但您有RDD[Double]。所以,你可以这样做(如果你每行有一个数字):

RDD[String]