您好我有2个数据帧加入
#df1
name genre count
satya drama 1
satya action 3
abc drame 2
abc comedy 2
def romance 1
#df2
name max_count
satya 3
abc 2
def 1
现在我想在名称和计数== max_count上加入2 dfs,但我收到错误
import pyspark.sql.functions as F
from pyspark.sql.functions import count, col
from pyspark.sql.functions import struct
df = spark.read.csv('file',sep = '###', header=True)
df1 = df.groupBy("name", "genre").count()
df2 = df1.groupby('name').agg(F.max("count").alias("max_count"))
#Now trying to join both dataframes
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count))
final_df.show() ###Error
#py4j.protocol.Py4JJavaError: An error occurred while calling o207.showString.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
#Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: count(1)
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
但是“左”加入成功
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count), "left")
final_df.show() ###Success but i don't want left join , i want inner join
我的问题是为什么上面的那个失败了,我在那里做错了吗?
我引用了这个链接“Find maximum row per group in Spark DataFrame”。使用了第一个答案(2 groupby方法)。但是同样的错误。
我在spark-2.0.0-bin-hadoop2.7和python 2.7上。
请建议。谢谢。
上面的场景适用于spark 1.6(这非常令人惊讶,因为spark 2.0(或者我的安装,我会重新安装,检查并在这里更新)。
有没有人在火花2.0上试过这个并取得成功,请按照下面的Yaron的回答???
答案 0 :(得分:2)
更新:由于使用" count"似乎您的代码也失败了作为列名。 count似乎是DataFrame API中受保护的关键字。 将计数重命名为" mycount"解决了这个问题。修改了下面的工作代码,以支持我用来测试你的问题的spark版本1.5.2。
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/tmp/fac_cal.csv")
df1 = df.groupBy("name", "genre").count()
df1 = df1.select(col("name"),col("genre"),col("count").alias("mycount"))
df2 = df1.groupby('name').agg(F.max("mycount").alias("max_count"))
df2 = df2.select(col('name').alias('name2'),col("max_count"))
#Now trying to join both dataframes
final_df = df1.join(df2,[df1.name == df2.name2 , df1.mycount == df2.max_count])
final_df.show()
+-----+---------+-------+-----+---------+
| name| genre|mycount|name2|max_count|
+-----+---------+-------+-----+---------+
|brata| comedy| 2|brata| 2|
|brata| drama| 2|brata| 2|
|panda|adventure| 1|panda| 1|
|panda| romance| 1|panda| 1|
|satya| action| 3|satya| 3|
+-----+---------+-------+-----+---------+
https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
中复杂条件的示例cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
你可以尝试一下:
final_df = df1.join(df2, [df1.name == df2.name , df1.mycount == df2.max_count])
另请注意,根据规范" left"不属于有效的连接类型: how - str,默认'inner'。内部,外部,left_outer,right_outer,leftsemi之一。
答案 1 :(得分:2)
当我尝试加入两个DataFrame时遇到同样的问题,其中一个是GroupedData。当我在内连接之前缓存GroupedData DataFrame时,它对我有用。对于您的代码,请尝试:
df1 = df.groupBy("name", "genre").count().cache() # added cache()
df2 = df1.groupby('name').agg(F.max("count").alias("max_count")).cache() # added cache()
final_df = df1.join(df2, (df1.name == df2.name) & (df1.count == df2.max_count)) # no change
答案 2 :(得分:0)
我在各个dfs的join comparision('name','mycount')中的列中创建了一个单独的列('combined'),所以现在我有一个要比较的列,这不会产生任何问题,因为我正在比较只有一栏。
def combine_func(*args):
data = '_'.join([str(x) for x in args]) ###converting nonstring to str tehn concatenation
return data
combine_func = udf(combine_func, StringType()) ##register the func as udf
df1 = df1.withColumn('combined_new_1', combine_new(df1['name'],df1['mycount'])) ###a col having concatenated value from name and mycount columns eg: 'satya_3'
df2 = df2.withColumn('combined_new_2', combine_new(df2['name2'],df2['max_count']))
#df1.columns == 'name','genre', 'mycount', 'combined_new_1'
#df2.columns == 'name2', 'max_count', 'combined_new_2'
#Now join
final_df = df1.join(df2,df1.combined_new_1 == df2.combined_new_2, 'inner')
#final_df = df1.join(df2,df1.combined_new_1 == df2.combined_new_2, 'inner').select('the columns you want')
final_df.show() ####It is showing the result, Trust me.
除非您赶时间,否则请不要关注,更好地寻找可靠的解决方案。