发布标签:emr-5.24.0 Hadoop发行版:Amazon 2.8.5 应用程序:Spark 2.4.2,Hive 2.3.4
我试图获取每年不同模型的数量,并让该数量在每条记录的单独列中显示。
开始于:
prod_schema = StructType([
StructField("model", StringType(), False),
StructField("year", StringType(), False),
StructField("price", IntegerType(), False),
StructField("mileage", IntegerType(), False)]
)
dumba = [("Galaxy", "2017", 21841, 17529),
("Galaxy", "2017", 29395, 11892),
("Novato", "2018", 35644, 22876),
("Novato", "2017", 28864, 28286),
("Tagur", "2016", 22761, 62551),
("Tagur", "2011", 11952, 104222),
("Tagur", "2017", 30552, 88045),
("Mulion", "2015", 11054, 35644),
("Mulion", "2018", 15275, 43871),
("Mulion", "2016", 10684, 87112)]
df = spark.createDataFrame(dumba, schema=prod_schema)
df.show()
+------+----+-----+-------+
| model|year|price|mileage|
+------+----+-----+-------+
|Galaxy|2017|21841| 17529|
|Galaxy|2017|29395| 11892|
|Novato|2018|35644| 22876|
|Novato|2017|28864| 28286|
| Tagur|2016|22761| 62551|
| Tagur|2011|11952| 104222|
| Tagur|2017|30552| 88045|
|Mulion|2015|11054| 35644|
|Mulion|2018|15275| 43871|
|Mulion|2016|10684| 87112|
+------+----+-----+-------+
我想去:
+------+----+-----+-------+---------------+
| model|year|price|mileage|models_per_year|
+------+----+-----+-------+---------------+
|Galaxy|2017|21841| 17529| 3|
|Galaxy|2017|29395| 11892| 3|
|Novato|2018|35644| 22876| 2|
|Novato|2017|28864| 28286| 3|
| Tagur|2016|22761| 62551| 2|
| Tagur|2011|11952| 104222| 1|
| Tagur|2017|30552| 88045| 3|
|Mulion|2015|11054| 35644| 1|
|Mulion|2018|15275| 43871| 2|
|Mulion|2016|10684| 87112| 2|
+------+----+-----+-------+---------------+
我收到此错误:
Traceback (most recent call last):
File "/home/hadoop/mon/dummy_df.py", line 39, in <module>
df.select(F.col("model").distinct().count())).over(w0)
TypeError: 'Column' object is not callable
尝试执行以下代码时:
w0 = Window.partitionBy('year')
df = df.withColumn('models_per_year',
df.select("model").distinct().count())).over(w0)
我不确定该错误试图告诉我什么或如何解决该错误,因此我可以在不使用groupBy的情况下执行此操作(太昂贵)。有人有建议吗?
答案 0 :(得分:0)
据我所知,由于countDistinct当前不支持窗口函数,因此无法避免在不影响准确性的情况下使用groupBy。当您可能会遇到一些不准确之处时,您应该看看approx_count_distinct函数:
from pyspark.sql import Window
from pyspark.sql import functions as F
prod_schema = StructType([
StructField("model", StringType(), False),
StructField("year", StringType(), False),
StructField("price", IntegerType(), False),
StructField("mileage", IntegerType(), False)]
)
dumba = [("Galaxy", "2017", 21841, 17529),
("Galaxy", "2017", 29395, 11892),
("Novato", "2018", 35644, 22876),
("Novato", "2017", 28864, 28286),
("Tagur", "2016", 22761, 62551),
("Tagur", "2011", 11952, 104222),
("Tagur", "2017", 30552, 88045),
("Mulion", "2015", 11054, 35644),
("Mulion", "2018", 15275, 43871),
("Mulion", "2016", 10684, 87112)]
df = spark.createDataFrame(dumba, schema=prod_schema)
w0 = Window.partitionBy('year')
df = df.withColumn('models_per_year', F.approx_count_distinct('model', 0.02).over(w0))
df.show()
输出:
+------+----+-----+-------+---------------+
| model|year|price|mileage|models_per_year|
+------+----+-----+-------+---------------+
| Tagur|2016|22761| 62551| 2|
|Mulion|2016|10684| 87112| 2|
|Galaxy|2017|21841| 17529| 3|
|Galaxy|2017|29395| 11892| 3|
|Novato|2017|28864| 28286| 3|
| Tagur|2017|30552| 88045| 3|
|Novato|2018|35644| 22876| 2|
|Mulion|2018|15275| 43871| 2|
| Tagur|2011|11952| 104222| 1|
|Mulion|2015|11054| 35644| 1|
+------+----+-----+-------+---------------+