pySpark如何访问(key,tuple)RDD(python)中元组中的值

时间:2017-04-04 14:15:35

标签: python vector pyspark svm rdd

我正在尝试访问PipelineRDD中包含的值 以下是我的开始:

1。 RDD =(密钥,代码,值)

data = [(11720, (u'I50800', 0.08229813664596274)), (11720, (u'I50801', 0.03076923076923077))]

*强调文字* 2。我需要它按第一个值分组并将其转换为(key,tuple),其中tuple =(code,value)

testFeatures = lab_FeatureTuples = labEvents.select(' ITEMID',' SUBJECT_ID',' NORM_ITEM_CNT')\     .orderBy(' SUBJECT_ID'' ITEMID')\     .rdd.map(lambda(ITEMID,SUBJECT_ID,NORM_ITEM_CNT):( SUBJECT_ID,(ITEMID,NORM_ITEM_CNT)))     .groupByKey()

testFeatures =  [(11720, [(u'I50800', 0.08229813664596274)),  (u'I50801', 0.03076923076923077)])]

在元组=(代码,值)上,我希望得到以下内容:

从中创建一个sparseVector,以便我可以将它用于SVM模型

result.take(1)

1 个答案:

答案 0 :(得分:1)

这是一种方法:

import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)

data = [(11720, (u'I50800', 0.08229813664596274)), 
        (11720, (u'I50801', 0.03076923076923077))]
rdd = sc.parallelize(data)

df = sqlc.createDataFrame(rdd,  ['idx', 'tuple'])
df.show()

给出,

+-----+--------------------+
|  idx|               tuple|
+-----+--------------------+
|11720|[I50800,0.0822981...|
|11720|[I50801,0.0307692...|
+-----+--------------------+

现在定义pyspark用户定义的函数:

extract_tuple_0 = sf.udf(lambda x: x[0], returnType=sparktypes.StringType())
extract_tuple_1 = sf.udf(lambda x: x[1], returnType=sparktypes.FloatType())
df = df.withColumn('tup0', extract_tuple_0(sf.col('tuple')))

df = df.withColumn('tup1', extract_tuple_1(sf.col('tuple')))
df.show()

给出:

+-----+--------------------+----------+------+
|  idx|               tuple|      tup1|  tup0|
+-----+--------------------+----------+------+
|11720|[I50800,0.0822981...|0.08229814|I50800|
|11720|[I50801,0.0307692...|0.03076923|I50801|
+-----+--------------------+----------+------+