我有一个pyspark数据框包含以逗号分隔的数据行。我想拆分每一行并将LabeledPoints方法应用于它。然后将其转换为数据帧。
这是我的代码
{@inheritDoc}
应用.DF()后会出现以下错误消息。
import os.path
from pyspark.mllib.regression import LabeledPoint
import numpy as np
file_name = os.path.join('databricks-datasets', 'cs190', 'data-001', 'millionsong.txt')
raw_data_df = sqlContext.read.load(file_name, 'text')
rdd = raw_data_df.rdd.map(lambda line: line.split(',')).map(lambda seq:LabeledPoints(seq[0],seq[1:])).toDF()
- > 423 rdd,schema = self._createFromRDD(data,schema,samplingRatio) 424其他: 425 rdd,schema = self._createFromLocal(data,schema)
---------------------------------------------------------------------------
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent failure: Lost task 0.0 in stage 38.0 (TID 44, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-65-dc4d86a8ee45> in <module>()
----> 1 rdd = raw_data_df.rdd.map(lambda line: line.split(',')).map(lambda seq:LabeledPoints(seq[0],seq[1:])).toDF()
2 print(type(rdd))
3 #print(rdd.take(5))
/databricks/spark/python/pyspark/sql/context.py in toDF(self, schema, sampleRatio)
62 [Row(name=u'Alice', age=1)]
63 """
---> 64 return sqlContext.createDataFrame(self, schema, sampleRatio)
65
66 RDD.toDF = toDF
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
答案 0 :(得分:0)
回答: rdd = raw_data_df.map(lambda row:row [&#39; value&#39;] .split(&#39;,&#39;))。map(lambda seq:LabeledPoint(float(seq [0]), SEQ [1:]))toDF()
在这里,我需要使用行[&#39;值&#39;]专门引用每行文本,即使行中只有一个功能。