尝试Window函数时,为什么pyspark抛出“'Column'对象不可调用”?

时间:2019-06-30 21:41:23

标签: pyspark

发布标签:emr-5.24.0 Hadoop发行版:Amazon 2.8.5 应用程序:Spark 2.4.2,Hive 2.3.4

我试图获取每年不同模型的数量,并让该数量在每条记录的单独列中显示。

开始于:

prod_schema = StructType([
       StructField("model", StringType(), False),          
       StructField("year", StringType(), False),          
       StructField("price", IntegerType(), False),          
       StructField("mileage", IntegerType(), False)]   
       )

dumba = [("Galaxy", "2017", 21841, 17529), 
     ("Galaxy", "2017", 29395, 11892), 
     ("Novato", "2018", 35644, 22876), 
     ("Novato", "2017", 28864, 28286), 
     ("Tagur", "2016", 22761, 62551), 
     ("Tagur", "2011", 11952, 104222), 
     ("Tagur", "2017", 30552, 88045), 
     ("Mulion", "2015", 11054, 35644), 
     ("Mulion", "2018", 15275, 43871), 
     ("Mulion", "2016", 10684, 87112)]

 df = spark.createDataFrame(dumba, schema=prod_schema)
 df.show()

+------+----+-----+-------+
| model|year|price|mileage|
+------+----+-----+-------+
|Galaxy|2017|21841|  17529|
|Galaxy|2017|29395|  11892|
|Novato|2018|35644|  22876|
|Novato|2017|28864|  28286|
| Tagur|2016|22761|  62551|
| Tagur|2011|11952| 104222|
| Tagur|2017|30552|  88045|
|Mulion|2015|11054|  35644|
|Mulion|2018|15275|  43871|
|Mulion|2016|10684|  87112|
+------+----+-----+-------+

我想去:

+------+----+-----+-------+---------------+
| model|year|price|mileage|models_per_year|
+------+----+-----+-------+---------------+
|Galaxy|2017|21841|  17529|              3|
|Galaxy|2017|29395|  11892|              3|
|Novato|2018|35644|  22876|              2|
|Novato|2017|28864|  28286|              3|
| Tagur|2016|22761|  62551|              2|
| Tagur|2011|11952| 104222|              1|
| Tagur|2017|30552|  88045|              3|
|Mulion|2015|11054|  35644|              1|
|Mulion|2018|15275|  43871|              2|
|Mulion|2016|10684|  87112|              2|
+------+----+-----+-------+---------------+

我收到此错误:

Traceback (most recent call last):
File "/home/hadoop/mon/dummy_df.py", line 39, in <module>
df.select(F.col("model").distinct().count())).over(w0)
TypeError: 'Column' object is not callable

尝试执行以下代码时:

w0 = Window.partitionBy('year')
df = df.withColumn('models_per_year',
           df.select("model").distinct().count())).over(w0)

我不确定该错误试图告诉我什么或如何解决该错误,因此我可以在不使用groupBy的情况下执行此操作(太昂贵)。有人有建议吗?

1 个答案:

答案 0 :(得分:0)

据我所知,由于countDistinct当前不支持窗口函数,因此无法避免在不影响准确性的情况下使用groupBy。当您可能会遇到一些不准确之处时,您应该看看approx_count_distinct函数:

from pyspark.sql import Window
from pyspark.sql import functions as F

prod_schema = StructType([
       StructField("model", StringType(), False),          
       StructField("year", StringType(), False),          
       StructField("price", IntegerType(), False),          
       StructField("mileage", IntegerType(), False)]   
       )

dumba = [("Galaxy", "2017", 21841, 17529), 
     ("Galaxy", "2017", 29395, 11892), 
     ("Novato", "2018", 35644, 22876), 
     ("Novato", "2017", 28864, 28286), 
     ("Tagur", "2016", 22761, 62551), 
     ("Tagur", "2011", 11952, 104222), 
     ("Tagur", "2017", 30552, 88045), 
     ("Mulion", "2015", 11054, 35644), 
     ("Mulion", "2018", 15275, 43871), 
     ("Mulion", "2016", 10684, 87112)]

df = spark.createDataFrame(dumba, schema=prod_schema)

w0 = Window.partitionBy('year')
df = df.withColumn('models_per_year', F.approx_count_distinct('model', 0.02).over(w0))
df.show()

输出:

+------+----+-----+-------+---------------+ 
| model|year|price|mileage|models_per_year| 
+------+----+-----+-------+---------------+ 
| Tagur|2016|22761|  62551|              2| 
|Mulion|2016|10684|  87112|              2| 
|Galaxy|2017|21841|  17529|              3| 
|Galaxy|2017|29395|  11892|              3| 
|Novato|2017|28864|  28286|              3| 
| Tagur|2017|30552|  88045|              3| 
|Novato|2018|35644|  22876|              2| 
|Mulion|2018|15275|  43871|              2| 
| Tagur|2011|11952| 104222|              1| 
|Mulion|2015|11054|  35644|              1| 
+------+----+-----+-------+---------------+