Question

我正在写一个函数

以RDD作为输入
拆分逗号分隔值
然后将每一行转换为标记的点对象

最后将输出作为数据帧

获取

code: 

def parse_points(raw_rdd):

    cleaned_rdd = raw_rdd.map(lambda line: line.split(","))
    new_df = cleaned_rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])).toDF()
    return new_df


output = parse_points(input_rdd)

如果我运行代码，那么没有错误它正常工作。

但是在添加该行时，

 output.take(5)

我收到错误：

org.apache.spark.SparkException: Job aborted due to stage failure: Task   0 in stage 129.0 failed 1 times, most recent failure: Lost task 0.0 in s    stage 129.0 (TID 152, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Py4JJavaError       Traceback (most recent call last)
<ipython-input-100-a68c448b64b0> in <module>()
 20 
 21 output = parse_points(raw_rdd)
 ---> 22 print output.show()

请告诉我这是什么错误。

Answer 1

在执行操作之前没有错误的原因：

 output.take(5)

是由于火花的性质，这是懒惰的。即，在执行“take（5）”

动作之前，没有任何内容在执行中执行

您的代码中存在一些问题，我认为由于[line [1：]]

中的额外“[”和“]”而导致您失败

所以你需要删除[line [1：]]中的额外“[”和“]”（并且只保留行[1：]）

您可能需要解决的另一个问题是缺少数据帧架构。

即。将“toDF（）”替换为“toDF（[”features“，”label“]）” 这将为数据帧提供架构。

Answer 2

尝试：

>>> raw_rdd.map(lambda line: line.split(",")) \
...     .map(lambda line:LabeledPoint(line[0], [float(x) for x in line[1:]])

标记点对象pyspark中的错误

2 个答案: