在Apache Spark 2.0+中,我如何找到最小值的最大值,在以下问题中:
var loginUrl = "https://prowand.pro-unlimited.com/login.html";
CookieContainer cookies = new CookieContainer();
HttpClientHandler handler = new HttpClientHandler();
handler.CookieContainer = cookies;
HttpClient client = new HttpClient(handler);
//login
var resp1 = await client.GetAsync(loginUrl);
var content1 = await resp1.Content.ReadAsStringAsync();
所需的数据框是:
df1
+---+---+
| id| ts|
+---+---+
| 1| 20|
| 2| 15|
+---+---+
df2
+---+---+
| id| ts|
+---+---+
| 1| 10|
| 1| 25|
| 1| 36|
| 2| 25|
| 2| 35|
+---+---+
单词中的问题:对于+---+---+
| id| ts|
+---+---+
| 1| 10|
| 2| 15|
+---+---+
中的每个id
,选择的df1
值小于ts
中的ts
值,如果不是这样的话值存在,只需在df1
中打印ts
值。
答案 0 :(得分:1)
只需汇总加入,然后选择when
:
from pyspark.sql.functions import col, when, max as max_
df1 = spark.createDataFrame(
[(1, 20),(2, 15)], ("id", "ts")
)
df2 = spark.createDataFrame(
[(1, 10), (1, 25), (1, 36), (2, 25), (2, 35)], ("id", "ts")
)
ts = when(
col("df2.ts") < col("df1.ts"), col("df2.ts")
).otherwise(col("df1.ts")).alias("ts")
(df2
.groupBy("id")
.agg(max_("ts").alias("ts")).alias("df2")
.join(df1.alias("df1"), ["id"])
.select("id", ts)
.show())
# +---+---+
# | id| ts|
# +---+---+
# | 1| 20|
# | 2| 15|
# +---+---+
如果并非所有ID都具有df2
使用右外连接的等价物:
.join(df1.alias("df1"), ["id"], "right")
并将ts
调整为
ts = coalesce(when(
col("df2.ts") < col("df1.ts"), col("df2.ts")
).otherwise(col("df1.ts")), col("df1.ts")).alias("ts")