python + pyspark:pyspark中多列比较的内连接错误

时间:2016-09-22 06:35:08

标签: python apache-spark pyspark pyspark-sql

您好我有2个数据帧加入

#df1
 name    genre  count
 satya   drama    1
 satya   action   3
 abc     drame    2
 abc     comedy   2
 def     romance  1

#df2
 name  max_count
 satya  3
 abc    2
 def    1

现在我想在名称和计数== max_count上加入2 dfs,但我收到错误

import pyspark.sql.functions as F
from pyspark.sql.functions import count, col
from pyspark.sql.functions import struct
df = spark.read.csv('file',sep = '###', header=True)
df1 = df.groupBy("name", "genre").count()
df2 = df1.groupby('name').agg(F.max("count").alias("max_count"))
#Now trying to join both dataframes
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count))
final_df.show() ###Error
#py4j.protocol.Py4JJavaError: An error occurred while calling o207.showString.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
#Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: count(1)
    at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)

但是“左”加入成功

final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count), "left")
final_df.show()  ###Success but i don't want left join , i want inner join

我的问题是为什么上面的那个失败了,我在那里做错了吗?

我引用了这个链接“Find maximum row per group in Spark DataFrame”。使用了第一个答案(2 groupby方法)。但是同样的错误。

我在spark-2.0.0-bin-hadoop2.7和python 2.7上。

请建议。谢谢。

编辑:

上面的场景适用于spark 1.6(这非常令人惊讶,因为spark 2.0(或者我的安装,我会重新安装,检查并在这里更新)。

有没有人在火花2.0上试过这个并取得成功,请按照下面的Yaron的回答???

3 个答案:

答案 0 :(得分:2)

更新:由于使用" count"似乎您的代码也失败了作为列名。 count似乎是DataFrame API中受保护的关键字。 将计数重命名为" mycount"解决了这个问题。修改了下面的工作代码,以支持我用来测试你的问题的spark版本1.5.2。

df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/tmp/fac_cal.csv")
df1 = df.groupBy("name", "genre").count()
df1 = df1.select(col("name"),col("genre"),col("count").alias("mycount"))
df2 = df1.groupby('name').agg(F.max("mycount").alias("max_count"))
df2 = df2.select(col('name').alias('name2'),col("max_count"))
#Now trying to join both dataframes
final_df = df1.join(df2,[df1.name == df2.name2 , df1.mycount == df2.max_count])
final_df.show()

+-----+---------+-------+-----+---------+
| name|    genre|mycount|name2|max_count|
+-----+---------+-------+-----+---------+
|brata|   comedy|      2|brata|        2|
|brata|    drama|      2|brata|        2|
|panda|adventure|      1|panda|        1|
|panda|  romance|      1|panda|        1|
|satya|   action|      3|satya|        3|
+-----+---------+-------+-----+---------+

https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

中复杂条件的示例
cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
你可以尝试一下:

final_df = df1.join(df2, [df1.name == df2.name , df1.mycount == df2.max_count])

另请注意,根据规范" left"不属于有效的连接类型: how - str,默认'inner'。内部,外部,left_outer,right_outer,leftsemi之一。

答案 1 :(得分:2)

当我尝试加入两个DataFrame时遇到同样的问题,其中一个是GroupedData。当我在内连接之前缓存GroupedData DataFrame时,它对我有用。对于您的代码,请尝试:

df1 = df.groupBy("name", "genre").count().cache()    # added cache()
df2 = df1.groupby('name').agg(F.max("count").alias("max_count")).cache()   # added cache()
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count))    # no change

答案 2 :(得分:0)

我在spark 2.0中的解决方法

我在各个dfs的join comparision('name','mycount')中的列中创建了一个单独的列('combined'),所以现在我有一个要比较的列,这不会产生任何问题,因为我正在比较只有一栏。

def combine_func(*args):
  data = '_'.join([str(x) for x in args]) ###converting nonstring to str tehn concatenation
  return data
combine_func = udf(combine_func, StringType())  ##register the func as udf
df1 = df1.withColumn('combined_new_1', combine_new(df1['name'],df1['mycount']))  ###a col having concatenated value from name and mycount columns eg: 'satya_3'
df2 = df2.withColumn('combined_new_2', combine_new(df2['name2'],df2['max_count']))
#df1.columns == 'name','genre', 'mycount', 'combined_new_1'
#df2.columns == 'name2', 'max_count', 'combined_new_2'
#Now join 
final_df = df1.join(df2,df1.combined_new_1 == df2.combined_new_2, 'inner')
#final_df = df1.join(df2,df1.combined_new_1 == df2.combined_new_2, 'inner').select('the columns you want')
final_df.show()  ####It is showing the result, Trust me.

除非您赶时间,否则请不要关注,更好地寻找可靠的解决方案。