在嵌套字段上加入PySpark DataFrames

时间:2016-04-12 14:24:45

标签: apache-spark dataframe join pyspark apache-spark-sql

我想在这两个PySpark DataFrame之间进行连接:

from pyspark import SparkContext
from pyspark.sql.functions import col

sc = SparkContext()

df1 = sc.parallelize([
    ['owner1', 'obj1', 0.5],
    ['owner1', 'obj1', 0.2],
    ['owner2', 'obj2', 0.1]
]).toDF(('owner', 'object', 'score'))

df2 = sc.parallelize(
          [Row(owner=u'owner1',
           objects=[Row(name=u'obj1', value=Row(fav=True, ratio=0.3))])]).toDF()

必须对对象的名称执行连接,即df2的对象中的字段 name 和df1的对象

我可以在嵌套字段上执行SELECT,如

df2.where(df2.owner == 'owner1').select(col("objects.value.ratio")).show()

但我无法运行此联接:

df2.alias('u').join(df1.alias('s'), col('u.objects.name') == col('s.object'))

返回错误

  

pyspark.sql.utils.AnalysisException:你“无法解决   '(objects.name = cast(object as double))'由于数据类型   不匹配:'(objects.name = cast(object as。)中的不同类型   double))'(array and double).;“

任何想法如何解决这个问题?

1 个答案:

答案 0 :(得分:7)

由于您希望匹配并提取特定元素,因此最简单的方法是matches = df2.withColumn("object", explode(col("objects"))).alias("u").join( df1.alias("s"), col("s.object") == col("u.object.name") ) matches.show() ## +-------------------+------+-----------------+------+------+-----+ ## | objects| owner| object| owner|object|score| ## +-------------------+------+-----------------+------+------+-----+ ## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1| obj1| 0.5| ## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1| obj1| 0.2| ## +-------------------+------+-----------------+------+------+-----+ 行:

array_contains

替代方案,但非常低效的方法是使用matches_contains = df1.alias("s").join( df2.alias("u"), expr("array_contains(objects.name, object)"))

matches_contains.explain()
## == Physical Plan ==
## Filter array_contains(objects#6.name,object#4)
## +- CartesianProduct
##    :- Scan ExistingRDD[owner#3,object#4,score#5] 
##    +- Scan ExistingRDD[objects#6,owner#7]

它无效,因为它将扩展为笛卡尔积:

array_contains

如果数组的大小相对较小,则可以生成RadioButton的优化版本,如我在此处所示:Filter by whether column value equals a list in spark