我有两个PySpark数据框,我想将它们加入一个新的数据框。连接操作似乎显示一个空数据框。
我正在使用Jupyter笔记本在PySpark内核上,在具有单个主服务器,4个工作器和YARN的群集上评估代码,以进行资源分配。
from pyspark.sql.functions import monotonically_increasing_id,udf
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import DenseVector
firstelement=udf(lambda v:float(v[1]),FloatType())
a = [{'c_id': 'a', 'cv_id': 'b', 'id': 1}, {'c_id': 'c', 'cv_id': 'd', 'id': 2}]
ip = spark.createDataFrame(a)
b = [{'probability': DenseVector([0.99,0.01]), 'id': 1}, {'probability': DenseVector([0.6,0.4]), 'id': 2}]
op = spark.createDataFrame(b)
op.show() #shows the df
#probability, id
#[0.99, 0.01], 1
##probability is a dense vector, id is bigint
ip.show() #shows the df
#c_id, cv_id, id
#a,b,1
##c_id and cv_id are strings, id is bigint
op_final = op.join(ip, ip.id == op.id).select('c_id','cv_id',firstelement('probability')).withColumnRenamed('<lambda>(probability)','probability')
op_final.show() #gives a null df
#but the below seems to work, however, quite slow
ip.collect()
op.collect()
op_final.collect()
op_final.show() #shows the joined df
也许这是我缺乏Spark方面的专业知识,但是有人可以解释一下为什么我能够看到前两个数据帧,但是除非我使用collect(),否则看不到联接的数据帧?