我是Apache Spark的新手,我们正在尝试使用MLIB实用程序进行一些分析。我整理了一些代码,将我的数据转换为特征,然后应用线性回归算法。我正面临一些问题。如果这是一个愚蠢的问题,请帮助和原谅
我的人数据似乎是
1,1000.00,36 2,2000.00,35 3,2345.50,37 4,3323.00,45
只是一个让代码正常工作的简单示例
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Person(rating: String, income: Double, age: Int)
val persondata = sc.textFile("D:/spark/mydata/persondata.txt").map(_.split(",")).map(p => Person(p(0), p(1).toDouble, p(2).toInt))
def prepareFeatures(people: Seq[Person]): Seq[org.apache.spark.mllib.linalg.Vector] = {
val maxIncome = people.map(_ income) max
val maxAge = people.map(_ age) max
people.map (p =>
Vectors.dense(
if (p.rating == "A") 0.7 else if (p.rating == "B") 0.5 else 0.3,
p.income / maxIncome,
p.age.toDouble / maxAge))
}
def prepareFeaturesWithLabels(features: Seq[org.apache.spark.mllib.linalg.Vector]): Seq[LabeledPoint] =
(0d to 1 by (1d / features.length)) zip(features) map(l => LabeledPoint(l._1, l._2))
---Its working till here.
---It breaks in the below code
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people))
scala> val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
<console>:36: error: not found: value people
Error occurred in an application involving default arguments.
val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
^
请告知
答案 0 :(得分:1)
你似乎正朝着正确的方向前进,但有一些小问题。首先,您试图引用尚未定义的值(人)。更一般地说,您似乎在编写代码来处理序列,而是应该修改代码以使用RDD(或DataFrames)。此外,您似乎正在使用parallelize
来尝试parallelize
您的操作,但parallelize
是一种帮助方法,可以获取本地集合并使其可用作分布式RDD
。我可能建议您查看编程指南或其他一些文档,以便更好地了解Spark API。祝你在Spark的冒险中好运。