I always get this error: AnalysisException: u"cannot resolve 'substring(l,1,-1)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'l' is of array type.;"
Quite confused because l[0] is a string, and matches arg 1. dataframe has only one column named 'value', which is a comma separated string. And I want to convert this original dataframe to another dataframe of object LabeledPoint, with the first element to be 'label' and the others to be 'features'.
from pyspark.mllib.regression import LabeledPoint
def parse_points(dataframe):
df1=df.select(split(dataframe.value,',').alias('l'))
u_label_point=udf(LabeledPoint)
df2=df1.select(u_label_point(col('l')[0],col('l')[1:-1]))
return df2
parsed_points_df = parse_points(raw_data_df)