Question

我有以下数据集，并且正在使用 PySpark

df = sparkSession.createDataFrame([(5, 'Samsung', '2018-02-23'),
                                   (8, 'Apple', '2018-02-22'),
                                   (5, 'Sony', '2018-02-21'),
                                   (5, 'Samsung', '2018-02-20'),
                                   (8, 'LG', '2018-02-20')],
                                   ['ID', 'Product', 'Date']
                                  )

+---+-------+----------+
| ID|Product|      Date|
+---+-------+----------+
|  5|Samsung|2018-02-23|
|  8|  Apple|2018-02-22|
|  5|   Sony|2018-02-21|
|  5|Samsung|2018-02-20|
|  8|     LG|2018-02-20|
+---+-------+----------+
# Each ID will appear ALWAYS at least 2 times (do not consider the case of unique IDs in this df)

每个ID仅在代表较高频率时才应递增PRODUCT计数器。如果频率相等，则应以最近的日期来决定哪个产品获得+1。

从上面的示例中，所需的输出将是：

+-------+-------+
|Product|Counter|
+-------+-------+
|Samsung|      1|
|  Apple|      1|
|   Sony|      0|
|     LG|      0|
+-------+-------+


# Samsung - 1 (preferred twice by ID=5)
# Apple - 1 (preferred by ID=8 more recently than LG)
# Sony - 0 (because ID=5 preferred Samsung 2 time, and Sony only 1) 
# LG - 0 (because ID=8 preferred Apple more recently)

使用PySpark达到此结果的最有效方法是什么？

Answer 1

IIUC，您想为每个ID选择最频繁的产品，并使用最近的Date

因此，首先，我们可以使用以下方法获取每个产品/ ID对的计数：

import pyspark.sql.functions as f
from pyspark.sql import Window

df = df.select(
    'ID',
    'Product',
    'Date', 
    f.count('Product').over(Window.partitionBy('ID', 'Product')).alias('count')
)
df.show()
#+---+-------+----------+-----+
#| ID|Product|      Date|count|
#+---+-------+----------+-----+
#|  5|   Sony|2018-02-21|    1|
#|  8|     LG|2018-02-20|    1|
#|  8|  Apple|2018-02-22|    1|
#|  5|Samsung|2018-02-23|    2|
#|  5|Samsung|2018-02-20|    2|
#+---+-------+----------+-----+

现在，您可以使用Window对每个ID的每个产品进行排名。我们可以使用pyspark.sql.functions.desc()来按count和Date降序排序。如果row_number()等于1，则表示该行是第一行。

w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df = df.select(
    'Product',
    (f.row_number().over(w) == 1).cast("int").alias('Counter')
)
df.show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung|      1|
#|Samsung|      0|
#|   Sony|      0|
#|  Apple|      1|
#|     LG|      0|
#+-------+-------+

最后groupBy()产品，并为Counter选择最大值：

df.groupBy('Product').agg(f.max('Counter').alias('Counter')).show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|   Sony|      0|
#|Samsung|      1|
#|     LG|      0|
#|  Apple|      1|
#+-------+-------+

更新

这里有一些简单的方法：

w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df.groupBy('ID', 'Product')\
    .agg(f.max('Date').alias('Date'), f.count('Product').alias('Count'))\
    .select('Product', (f.row_number().over(w) == 1).cast("int").alias('Counter'))\
    .show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung|      1|
#|   Sony|      0|
#|  Apple|      1|
#|     LG|      0|
#+-------+-------+

pySpark计数ID符合条件

1 个答案: