我正在尝试访问PipelineRDD中包含的值 以下是我的开始:
1。 RDD =(密钥,代码,值)
data = [(11720, (u'I50800', 0.08229813664596274)), (11720, (u'I50801', 0.03076923076923077))]
*强调文字* 2。我需要它按第一个值分组并将其转换为(key,tuple),其中tuple =(code,value)
testFeatures = lab_FeatureTuples = labEvents.select(' ITEMID',' SUBJECT_ID',' NORM_ITEM_CNT')\ .orderBy(' SUBJECT_ID'' ITEMID')\ .rdd.map(lambda(ITEMID,SUBJECT_ID,NORM_ITEM_CNT):( SUBJECT_ID,(ITEMID,NORM_ITEM_CNT))) .groupByKey()
testFeatures = [(11720, [(u'I50800', 0.08229813664596274)), (u'I50801', 0.03076923076923077)])]
在元组=(代码,值)上,我希望得到以下内容:
从中创建一个sparseVector,以便我可以将它用于SVM模型
result.take(1)
答案 0 :(得分:1)
这是一种方法:
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
data = [(11720, (u'I50800', 0.08229813664596274)),
(11720, (u'I50801', 0.03076923076923077))]
rdd = sc.parallelize(data)
df = sqlc.createDataFrame(rdd, ['idx', 'tuple'])
df.show()
给出,
+-----+--------------------+
| idx| tuple|
+-----+--------------------+
|11720|[I50800,0.0822981...|
|11720|[I50801,0.0307692...|
+-----+--------------------+
现在定义pyspark用户定义的函数:
extract_tuple_0 = sf.udf(lambda x: x[0], returnType=sparktypes.StringType())
extract_tuple_1 = sf.udf(lambda x: x[1], returnType=sparktypes.FloatType())
df = df.withColumn('tup0', extract_tuple_0(sf.col('tuple')))
df = df.withColumn('tup1', extract_tuple_1(sf.col('tuple')))
df.show()
给出:
+-----+--------------------+----------+------+
| idx| tuple| tup1| tup0|
+-----+--------------------+----------+------+
|11720|[I50800,0.0822981...|0.08229814|I50800|
|11720|[I50801,0.0307692...|0.03076923|I50801|
+-----+--------------------+----------+------+