将column_list转换中的各列提取为spark ml标签,功能格式

时间:2018-02-04 05:08:31

标签: python-3.x apache-spark pyspark apache-spark-mllib apache-spark-ml

sc.version => U' 2.2.0'

我在提取单个RDD数据列以构建spark ML管道时遇到了挑战。我先读了数据。使用struct函数在分组之前将列组合起来" A"。随后,我应用了collect_list。但是,我坚持将它转换为spark ml标签,功能格式如下:

df = sc.createDataFrame(input_data, ["label", "features"]) 

下面是初始表:

import pyspark
from pyspark import SparkContext
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

_c0|             A|     B|    C|     D|     E|       F|     G|     H|     I|       J|
+---+--------------+------+-----+------+------+--------+------+------+------+--------+
|  0|50009100010000|-999.0| 90.0|-999.0|-999.0|264.7466|-999.0|-999.0|-999.0|  -999.0|
|  1|50009100010000|-999.0| 90.5|-999.0|-999.0|258.5411|-999.0|-999.0|-999.0|  -999.0|
|  2|50009100010000|-999.0| 91.0|-999.0|-999.0|252.3356|-999.0|-999.0|-999.0|  -999.0|
|  3|50009100010000|-999.0| 91.5|-999.0|-999.0|246.1301|-999.0|-999.0|-999.0|  -999.0|
|  4|50009100010000|-999.0| 92.0|-999.0|-999.0|239.9246|-999.0|-999.0|-999.0| 39.8812|

在按A:

分组数据之前加入列
df_next = df.select("A", F.struct(["B","C","D","E","F","G","H","I","J"]).alias("allcol"))
df_next.show(5)

+--------------+--------------------+
|             A|              allcol|
+--------------+--------------------+
|50009100010000|[-999.0,90.0,-999...|
|50009100010000|[-999.0,90.5,-999...|
|50009100010000|[-999.0,91.0,-999...|
|50009100010000|[-999.0,91.5,-999...|
|50009100010000|[-999.0,92.0,-999...|

按A分组并收集嵌套列表中的列:

df_next_list = df_next.groupBy("A").agg(F.collect_list("allcol").alias("collected_cols"))

 +--------------+--------------------+
|             A|      collected_cols|
+--------------+--------------------+
|50009100090000|[[-999.0,210.0,-9...|
|50009100070000|[[-999.0,1110.0,-...|
|50009100170000|[[10.14438,303.0,...|
|50283200140000|[[9.8958,36.0,-99...|
|50009100040000|[[-999.0,290.5,-9...|
+--------------+--------------------+

这就是调用df_next_list.rdd.take(n)的样子:

Row(A=50009100090000, collected_cols=[Row(B=-999.0, C=210.0, D=-999.0, E=-999.0, F=16.016660690307617, G=-999.0, H=7.6022491455078125, I=-999.0, J=30.627119064331055), Row(B=-999.0, C=210.5, D=-999.0, E=-999.0, F=18.973539352416992, G=-999.0, H=15.784810066223145, I=-999.0, J=29.249160766601562......)])]

从此开始,我陷入了为collect_cols列表中的每个变量(BCDEFGHIJ)提取由A列分组的正确元素,并将df_next_list转换为Spark ML形式的标签(J)和特征(B,C,D) ,E,F,G,H,I)。有了它,我可以构建Spark ML算法。这就是我希望rdd看起来像这样可以很容易地提取元素:

Row(A=50009100090000, B=-999.0, C=210.0, D=-999.0, E=-999.0, F=264.7466125488281, G=-999.0, H=-999.0, I=-999.0, J=-999.0),....

任何帮助都将受到极大的赞赏。感谢

0 个答案:

没有答案