如何将类型<class'pyspark.sql.types.row'=“”>转换为Vector

时间:2017-03-02 04:11:07

标签: python apache-spark machine-learning pyspark k-means

我对Spark完全不熟悉,目前我正在尝试使用Python编写一个简单的代码,在一组数据上执行KMeans。

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import re
from pyspark.mllib.clustering import KMeans, KMeansModel
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import SparseVector
from numpy import array
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler

import pandas as pd
import numpy
df = pd.read_csv("/<path>/Wholesale_customers_data.csv")
sql_sc = SQLContext(sc)
cols = ["Channel", "Region", "Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen"]
s_df = sql_sc.createDataFrame(df)
vectorAss = VectorAssembler(inputCols=cols, outputCol="feature")
vdf = vectorAss.transform(s_df)
km = KMeans.train(vdf, k=2, maxIterations=10, runs=10, initializationMode="k-means||")
model = kmeans.fit(vdf)
cluster = model.clusterCenters()
print(cluster)

我将这些输入到pyspark shell中,当它运行model = kmeans.fit(vdf)时,我收到以下错误:

  

TypeError:无法将类型转换为Vector

     

在   org.apache.spark.api.python.PythonRunner $$匿名$ 1.read(PythonRDD.scala:166)   在   org.apache.spark.api.python.PythonRunner $$匿名$ 1(PythonRDD.scala:207)。   在   org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)   在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:277)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)at at   org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)at   org.apache.spark.rdd.RDD.iterator(RDD.scala:275)at   org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:277)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:277)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)at at   org.apache.spark.scheduler.Task.run(Task.scala:89)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:227)   在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)   在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)   在java.lang.Thread.run(Thread.java:745)17/02/26 23:31:58错误   执行程序:阶段23.0中的任务6.0中的异常(TID 113)   org.apache.spark.api.python.PythonException:Traceback(最近的   最后打电话):文件   “/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py”   第111行,在主进程()文件中   “/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/worker.py”   第106行,正在处理serializer.dump_stream(func(split_index,   iterator),outfile)文件   “/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py”   第263行,在dump_stream vs = list(itertools.islice(iterator,batch))   文件   “/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/init.py”,第77行,在_convert_to_vector中引发TypeError(“无法转换类型%s”   到Vector“%type(l)”TypeError:无法将类型转换为Vector

我得到的数据来自:https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv

有人可以告诉我这里出了什么问题,我错过了什么?我感谢任何帮助。

谢谢!

更新: @Garren 我得到的错误是:

  

我得到的错误是:&gt;&gt;&gt; kmm = kmeans.fit(s_df)17/03/02 21:58:01 INFO   BlockManagerInfo:删除localhost:56193 in中的broadcast_1_piece0   内存(大小:5.8 KB,免费:511.1 MB)17/03/02 21:58:01信息   ContextCleaner:Cleaned accumulator 5 17/03/02 21:58:01 INFO   BlockManagerInfo:删除localhost:56193 in中的broadcast_0_piece0   内存(大小:5.8 KB,免费:511.1 MB)17/03/02 21:58:01信息   ContextCleaner:清理累加器4

     

Traceback(最近一次调用最后一次):文件“”,第1行,in      文件   “/usr/hdp/2.5.0.0-1245/spark/python/pyspark/ml/pipeline.py”,第69行,   合适       return self._fit(dataset)File“/usr/hdp/2.5.0.0-1245/spark/python/pyspark/ml/wrapper.py”,第133行,   在_fit       java_model = self._fit_java(dataset)File“/usr/hdp/2.5.0.0-1245/spark/python/pyspark/ml/wrapper.py”,第130行,   在_fit_java中       return self._java_obj.fit(dataset._jdf)File“/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py”,   第813行,在调用文件中   “/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py”,第51行,在   装饰       提升AnalysisException(s.split(':',1)[1],stackTrace)pyspark.sql.utils.AnalysisException:u“无法解析给出的'功能'   输入栏:[Channel,Grocery,Fresh,Frozen,Detergents_Paper,   Region,Delicassen,Milk];“

1 个答案:

答案 0 :(得分:2)

在[即将弃用] spark mllib包中独家使用Spark 2.x ML包:

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
df = spark.read.option("inferSchema", "true").option("header", "true").csv("whole_customers_data.csv")
cols = df.columns
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf = vectorAss.transform(df)
kmeans = KMeans(k=2, maxIter=10, seed=1)
kmm = kmeans.fit(vdf)
kmm.clusterCenters()