我正在尝试计算两个功能之间的相关性,这两个功能是从两个单独的文本文件中读取的,如下所示。
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.stat.Statistics
import scala.io.Source
object Corr {
def main() {
val sparkSession = SparkSession.builder
.master("local")
.appName("Correlation")
.getOrCreate()
val sc = sparkSession.sparkContext
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray
val feature_1_dist = sc.parallelize(feature_1)
val feature_2_dist = sc.parallelize(feature_2)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
println(s"Correlation is: $correlation")
}
}
Corr.main()
但是,我收到以下错误:
overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double],method: String)scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double],method: String)scala.Double
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String], String)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
我想做的事情看起来与示例here非常相似,但我无法弄明白。
答案 0 :(得分:2)
正如错误消息中所述,您需要@JSFunction
public void log(Object messages) {
if(messages instanceof List) {
console.log(((List)messages).toArray());
}
else if (messages instanceof Object[]) {
console.log(((Object[])messages));
}
else {
console.log(new Object[]{messages});
}
}
,但您有RDD[Double]
。所以,你可以这样做(如果你每行有一个数字):
RDD[String]