PySpark:Labeled Point RDD的许多功能

时间:2015-09-21 03:05:51

标签: apache-spark pyspark rdd apache-spark-mllib

Spark的新手,我读过的所有示例都涉及小数据集,例如:

RDD = sc.parallelize([
LabeledPoint(1, [1.0, 2.0, 3.0]),
LabeledPoint(2, [3.0, 4.0, 5.0]),

但是,我有一个包含50多个功能的大型数据集。

行的示例

u'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5'

我想在PySpark中快速创建Labeledpoint RDD。我尝试将最后一个位置索引为Labeledpoint RDD中的第一个数据点,然后将前n-1个位置索引为密集向量。但是我收到以下错误。任何指导表示赞赏!注意:如果我在创建标记点时将[]更改为(),则会收到错误“无效语法”。

    df = myDataRDD.map(lambda line: line.split(','))
data = [
     LabeledPoint(df[54], df[0:53])
]
TypeError: 'PipelinedRDD' object does not support indexing
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-67-fa1b56e8441e> in <module>()
      2 df = myDataRDD.map(lambda line: line.split(','))
      3 data = [
----> 4      LabeledPoint(df[54], df[0:53])
      5 ]

TypeError: 'PipelinedRDD' object does not support indexing

2 个答案:

答案 0 :(得分:6)

当您获得错误时,您无法通过索引访问RDD。 您需要第二个map语句将序列转换为LabeledPoint s

rows = [u'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5', u'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5']

rows_rdd = sc.parallelize(rows) # create RDD with given rows

labeled_points_rdd = rows_rdd\
                     .map(lambda row: row.split(','))\                  # split rows into sequences
                     .map(lambda seq: LabeledPoint(seq[-1],seq[:-2]))   # create Labeled Points from these sequences with last Item as label

print labeled_points_rdd.take(2)
# prints [LabeledPoint(5.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,...]),
#         LabeledPoint(5.0,[2596.0,51.0,3.0,258.0,0.0,510.0,221.0,...])

请注意,python中的负数索引允许您向后访问序列。

使用.take(n),您可以获取RDD中的第一个n元素。

希望这有帮助。

答案 1 :(得分:2)

您不能使用索引,而是必须使用Spark API中提供的方法。所以:

data = [ LabeledPoint(myDataRDD.take(RDD.count()), #Last element
                      myDataRDD.top(RDD.count()-1)) #All but last ]

(未经测试,但这是一般的想法)