如何在PySpark DataFrame中将ArrayType转换为DenseVector?

时间:2016-08-18 19:02:22

标签: python apache-spark pyspark apache-spark-mllib apache-spark-ml

我在尝试构建ML Traceback (most recent call last): File "date_graph.py", line 12, in <module> tsline = TimeSeries(data, x='startTime', y='count', color=['startTime'], title="Timeseries", ylabel='count', legend=True) File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builders/timeseries_builder.py", line 102, in TimeSeries return create_and_build(builder_type, data, **kws) File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builder.py", line 67, in create_and_build chart.add_builder(builder) File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/chart.py", line 149, in add_builder builder.create(self) File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builder.py", line 518, in create chart.add_renderers(self, renderers) File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/chart.py", line 144, in add_renderers self.renderers += renderers File "/usr/local/lib/python2.7/dist-packages/bokeh/core/property_containers.py", line 18, in wrapper result = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/bokeh/core/property_containers.py", line 77, in __iadd__ return super(PropertyValueList, self).__iadd__(y) File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builders/line_builder.py", line 230, in yield_renderers x=group.get_values(self.x.selection), File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/data_source.py", line 173, in get_values return self.data[selection] File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__ return self._getitem_column(key) File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column return self._get_item_cache(key) File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache values = self._data.get(item) File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get loc = self.items.get_loc(item) File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154) File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018) File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368) File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322) KeyError: 'startTime' 时出现以下错误:

Pipeline

我的pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).' 列包含一个浮点值数组。听起来我需要将它们转换为某种类型的向量(它不是稀疏的,所以是DenseVector?)。有没有办法直接在DataFrame上执行此操作,还是需要转换为RDD?

1 个答案:

答案 0 :(得分:19)

您可以使用UDF:

udf(lambda vs: Vectors.dense(vs), VectorUDT())

在Spark&lt; 2.0导入:

from pyspark.mllib.linalg import Vectors, VectorUDT

在Spark 2.0+导入中:

from pyspark.ml.linalg import Vectors, VectorUDT

请注意,尽管实现相同,但这些类不兼容。

还可以提取单个要素并使用VectorAssembler进行汇编。假设输入列被称为features

from pyspark.ml.feature import VectorAssembler

n = ... # Size of features

assembler = VectorAssembler(
    inputCols=["features[{0}]".format(i) for i in range(n)], 
    outputCol="features_vector")

assembler.transform(df.select(
    "*", *(df["features"].getItem(i) for i in range(n))
))