在Apache Spark中需要一些特征提取输入

时间:2015-07-02 07:35:53

标签: apache-spark feature-extraction

我是Apache Spark的新手,我们正在尝试使用MLIB实用程序进行一些分析。我整理了一些代码,将我的数据转换为特征,然后应用线性回归算法。我正面临一些问题。如果这是一个愚蠢的问题,请帮助和原谅

我的人数据似乎是

1,1000.00,36 2,2000.00,35 3,2345.50,37 4,3323.00,45

只是一个让代码正常工作的简单示例

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint

case class Person(rating: String, income: Double, age: Int)
val persondata = sc.textFile("D:/spark/mydata/persondata.txt").map(_.split(",")).map(p => Person(p(0), p(1).toDouble, p(2).toInt))

def prepareFeatures(people: Seq[Person]): Seq[org.apache.spark.mllib.linalg.Vector] = {
  val maxIncome = people.map(_ income) max
  val maxAge = people.map(_ age) max

  people.map (p =>
    Vectors.dense(
      if (p.rating == "A") 0.7 else if (p.rating == "B") 0.5 else 0.3,
      p.income / maxIncome,
      p.age.toDouble / maxAge))
}


def prepareFeaturesWithLabels(features: Seq[org.apache.spark.mllib.linalg.Vector]): Seq[LabeledPoint] =
  (0d to 1 by (1d / features.length)) zip(features) map(l => LabeledPoint(l._1, l._2))

---Its working till here.
---It breaks in the below code

val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people))

scala> val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
<console>:36: error: not found: value people
Error occurred in an application involving default arguments.
       val data = sc.parallelize(prepareFeaturesWithLabels(prepareFeatures(people)))
                                                                           ^

请告知

1 个答案:

答案 0 :(得分:1)

你似乎正朝着正确的方向前进,但有一些小问题。首先,您试图引用尚未定义的值(人)。更一般地说,您似乎在编写代码来处理序列,而是应该修改代码以使用RDD(或DataFrames)。此外,您似乎正在使用parallelize来尝试parallelize您的操作,但parallelize是一种帮助方法,可以获取本地集合并使其可用作分布式RDD。我可能建议您查看编程指南或其他一些文档,以便更好地了解Spark API。祝你在Spark的冒险中好运。