spark Sql中的最小值

时间:2018-01-29 16:53:08

标签: apache-spark dataframe join apache-spark-sql spark-dataframe

在Apache Spark 2.0+中,我如何找到最小值的最大值,在以下问题中:

 var loginUrl = "https://prowand.pro-unlimited.com/login.html";
 CookieContainer cookies = new CookieContainer();
 HttpClientHandler handler = new HttpClientHandler();
 handler.CookieContainer = cookies;
 HttpClient client = new HttpClient(handler);
 //login
 var resp1 = await client.GetAsync(loginUrl);
 var content1 = await resp1.Content.ReadAsStringAsync();

所需的数据框是:

df1
+---+---+
| id| ts|
+---+---+
|  1| 20|
|  2| 15|
+---+---+

df2

+---+---+
| id| ts|
+---+---+
|  1| 10|
|  1| 25|
|  1| 36|
|  2| 25|
|  2| 35|
+---+---+

单词中的问题:对于+---+---+ | id| ts| +---+---+ | 1| 10| | 2| 15| +---+---+ 中的每个id,选择的df1值小于ts中的ts值,如果不是这样的话值存在,只需在df1中打印ts值。

1 个答案:

答案 0 :(得分:1)

只需汇总加入,然后选择when

from pyspark.sql.functions import col, when, max as max_

df1 = spark.createDataFrame(
    [(1, 20),(2, 15)], ("id", "ts")
 )
df2 = spark.createDataFrame(
    [(1, 10), (1, 25), (1, 36), (2, 25), (2, 35)], ("id", "ts")
)

ts = when(
    col("df2.ts") < col("df1.ts"), col("df2.ts")
).otherwise(col("df1.ts")).alias("ts")

(df2
    .groupBy("id")
    .agg(max_("ts").alias("ts")).alias("df2")
    .join(df1.alias("df1"), ["id"])
    .select("id", ts)
    .show())

# +---+---+                                                                       
# | id| ts|
# +---+---+
# |  1| 20|
# |  2| 15|
# +---+---+

如果并非所有ID都具有df2使用右外连接的等价物:

.join(df1.alias("df1"), ["id"], "right")

并将ts调整为

ts = coalesce(when(
    col("df2.ts") < col("df1.ts"), col("df2.ts")
).otherwise(col("df1.ts")), col("df1.ts")).alias("ts")