我在尝试构建ML Traceback (most recent call last):
File "date_graph.py", line 12, in <module>
tsline = TimeSeries(data, x='startTime', y='count', color=['startTime'], title="Timeseries", ylabel='count', legend=True)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builders/timeseries_builder.py", line 102, in TimeSeries
return create_and_build(builder_type, data, **kws)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builder.py", line 67, in create_and_build
chart.add_builder(builder)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/chart.py", line 149, in add_builder
builder.create(self)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builder.py", line 518, in create
chart.add_renderers(self, renderers)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/chart.py", line 144, in add_renderers
self.renderers += renderers
File "/usr/local/lib/python2.7/dist-packages/bokeh/core/property_containers.py", line 18, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/bokeh/core/property_containers.py", line 77, in __iadd__
return super(PropertyValueList, self).__iadd__(y)
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/builders/line_builder.py", line 230, in yield_renderers
x=group.get_values(self.x.selection),
File "/usr/local/lib/python2.7/dist-packages/bokeh/charts/data_source.py", line 173, in get_values
return self.data[selection]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'startTime'
时出现以下错误:
Pipeline
我的pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).'
列包含一个浮点值数组。听起来我需要将它们转换为某种类型的向量(它不是稀疏的,所以是DenseVector?)。有没有办法直接在DataFrame上执行此操作,还是需要转换为RDD?
答案 0 :(得分:19)
您可以使用UDF:
udf(lambda vs: Vectors.dense(vs), VectorUDT())
在Spark&lt; 2.0导入:
from pyspark.mllib.linalg import Vectors, VectorUDT
在Spark 2.0+导入中:
from pyspark.ml.linalg import Vectors, VectorUDT
请注意,尽管实现相同,但这些类不兼容。
还可以提取单个要素并使用VectorAssembler
进行汇编。假设输入列被称为features
:
from pyspark.ml.feature import VectorAssembler
n = ... # Size of features
assembler = VectorAssembler(
inputCols=["features[{0}]".format(i) for i in range(n)],
outputCol="features_vector")
assembler.transform(df.select(
"*", *(df["features"].getItem(i) for i in range(n))
))