我正在尝试将以下格式的spark rdd转换为pandas dataframe。
['f1\tf2\tf3\tf4\tf5','4.0\tNULL\t183.0\t190.0\tMARRIED']
当我执行下面的代码时,第3行给出错误:"实例化时出错' org.apache.spark.sql.hive.HiveSessionState':"
sparkDF = data.map(lambda x: str(x))
sparkDF2 = sparkDF.map(lambda w: w.split('\t'))
sparkDF3 = sparkDF2.toDF()
任何建议都将不胜感激!!
答案 0 :(得分:0)
在熊猫中
x=['f1\tf2\tf3\tf4\tf5','4.0\tNULL\t183.0\t190.0\tMARRIED']
rdd=sc.parallelize(x)
#list=[l.split('\t') for l in ','.join(rdd.collect()).split(',')]
#or
list=rdd.map(lambda x:x.split("\t")).collect()
import pandas as pnd
pnd.DataFrame(list)
在Spark中
x=['f1\tf2\tf3\tf4\tf5','4.0\tNULL\t183.0\t190.0\tMARRIED']
rdd=sc.parallelize(x)
df=rdd.map(lambda x:x.split("\t")).toDF()
df.show()
+---+----+-----+-----+-------+
| _1| _2| _3| _4| _5|
+---+----+-----+-----+-------+
| f1| f2| f3| f4| f5|
|4.0|NULL|183.0|190.0|MARRIED|
+---+----+-----+-----+-------+