我尝试使用spark
(任何风格:pyspark
,spark
,spark sql
等来计算复合兴趣(种类)。
我的数据形式如下:
+------------+------+------+--------+
| population | rate | year | city |
+------------+------+------+--------+
| 100 | 0.1 | 1 | one |
+------------+------+------+--------+
| 100 | 0.11 | 2 | one |
+------------+------+------+--------+
| 100 | 0.12 | 3 | one |
+------------+------+------+--------+
| 200 | 0.1 | 1 | two |
+------------+------+------+--------+
| 1000 | 0.21 | 2 | three |
+------------+------+------+--------+
| 1000 | 0.22 | 3 | three |
+------------+------+------+--------+
population
列错误(它来自两个表之间的join
,未显示。)
我想使用前一行population
的结果更新population*(1 + rate)
列。我知道在sql
我可以使用recursive CTE
,但hiveql
不支持。
你能给我一些建议吗?
答案 0 :(得分:3)
据我所知,你所需要的只是一些基本的代数和窗函数。首先让我们重新创建示例数据:
import pandas as pd # Just to make a reproducible example
pdf = pd.DataFrame({
'city': {0: 'one', 1: 'one', 2: 'one', 3: 'two', 4: 'three', 5: 'three'},
'population': {0: 100, 1: 100, 2: 100, 3: 200, 4: 1000, 5: 1000},
'rate': {0: 0.10000000000000001,
1: 0.11,
2: 0.12,
3: 0.10000000000000001,
4: 0.20999999999999999,
5: 0.22},
'year': {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3}})
df = sqlContext.createDataFrame(pdf)
df.show()
## +-----+----------+----+----+
## | city|population|rate|year|
## +-----+----------+----+----+
## | one| 100| 0.1| 1|
## | one| 100|0.11| 2|
## | one| 100|0.12| 3|
## | two| 200| 0.1| 1|
## |three| 1000|0.21| 2|
## |three| 1000|0.22| 3|
## +-----+----------+----+----+
接下来我们定义窗口:
import sys
from pyspark.sql.window import Window
from pyspark.sql.functions import exp, log, sum, first, col, coalesce
# Base window
w = Window.partitionBy("city").orderBy("year")
# ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
wr = w.rowsBetween(-sys.maxsize, -1)
和一些专栏:
# Take a sum of logarithms of rates over the window
log_sum = sum(log(col("rate") + 1)).over(wr)
# Take sum of logs and exponentiate to go back to original space
cumulative_rate = exp(log_sum).alias("cumulative_rate")
# Find base population for each group
base_population = first("population").over(w).alias("base_population")
# Prepare final column (base population * cumulative product of rates)
current_population = coalesce(
# This is null for the first observation in a group
cumulative_rate * base_population,
# so we provide population as an alternative
col("population")
).alias("current_population")
最后我们可以使用以下这些
df.select("*", current_population).show()
## +-----+----------+----+----+------------------+
## | city|population|rate|year|current_population|
## +-----+----------+----+----+------------------+
## |three| 1000|0.21| 2| 1000.0|
## |three| 1000|0.22| 3| 1210.0|
## | two| 200| 0.1| 1| 200.0|
## | one| 100| 0.1| 1| 100.0|
## | one| 100|0.11| 2|110.00000000000001|
## | one| 100|0.12| 3|122.10000000000004|
## +-----+----------+----+----+------------------+