找到spark中两个表之间的最近时间

时间:2016-07-27 21:50:46

标签: sql apache-spark pyspark apache-spark-sql

我正在使用pyspark,我有两个这样的数据帧:

user         time          bus
 A    2016/07/18 12:00:00   1
 B    2016/07/19 12:00:00   2
 C    2016/07/20 12:00:00   3

bus          time          stop
 1    2016/07/18 11:59:40   sA
 1    2016/07/18 11:59:50   sB
 1    2016/07/18 12:00:05   sC
 2    2016/07/19 11:59:40   sB
 2    2016/07/19 12:00:10   sC
 3    2016/07/20 11:59:55   sD
 3    2016/07/20 12:00:10   sE

现在我想知道用户根据总线编号和第二个表中最近的时间报告的停止时间。

例如,在表1中,用户A在2016/07/18 12:00:00报告,他在1号公交车上,根据第二张表,有3条公交车记录,但最近的时间是2016/07/18 12:00:05(第三条记录),所以用户现在在sC。

所需的输出应该是这样的:

user         time          bus  stop
 A    2016/07/18 12:00:00   1    sC
 B    2016/07/19 12:00:00   2    sC
 C    2016/07/20 12:00:00   3    sD

我已将时间转移到时间戳中,因此唯一的问题是找到总线编号为eqaul的最近时间戳。

因为我现在不熟悉sql,所以我尝试使用map函数来查找最接近的时间及其停止,这意味着我必须在map函数中使用sqlContext.sql,并且火花dosen&#39 ;似乎允许这样:

  

异常:您似乎尝试从广播变量,操作或转换引用SparkContext。 SparkContext只能在驱动程序上使用,而不能在工作程序上运行的代码中使用。有关更多信息,请参阅SPARK-5063。

那么如何编写一个sql查询来获得正确的输出呢?

1 个答案:

答案 0 :(得分:6)

这可以使用窗口函数来完成。

# Data validation for the variable "a"
while True:
    a = int(input("Enter mark of BIOLOGY: "))
    if 0 <= a <= 90:
         break
    else:
         print("The mark is not between 0 and 90. Please enter a new mark.")

输出:

from pyspark.sql.window import Window
from pyspark.sql import Row, functions as W

def tm(str):
    return datetime.strptime(str, "%Y/%m/%d %H:%M:%S")

#setup data
userTime = [ Row(user="A",time=tm("2016/07/18 12:00:00"),bus = 1) ]
userTime.append(Row(user="B",time=tm("2016/07/19 12:00:00"),bus = 2))
userTime.append(Row(user="C",time=tm("2016/07/20 12:00:00"),bus = 3))

busTime = [ Row(bus=1,time=tm("2016/07/18 11:59:40"),stop = "sA") ]
busTime.append(Row(bus=1,time=tm("2016/07/18 11:59:50"),stop = "sB"))
busTime.append(Row(bus=1,time=tm("2016/07/18 12:00:05"),stop = "sC"))
busTime.append(Row(bus=2,time=tm("2016/07/19 11:59:40"),stop = "sB"))
busTime.append(Row(bus=2,time=tm("2016/07/19 12:00:10"),stop = "sC"))
busTime.append(Row(bus=3,time=tm("2016/07/20 11:59:55"),stop = "sD"))
busTime.append(Row(bus=3,time=tm("2016/07/20 12:00:10"),stop = "sE"))

#create RDD
userDf = sc.parallelize(userTime).toDF().alias("usertime")
busDf = sc.parallelize(busTime).toDF().alias("bustime")

joinedDF = userDf.join(busDf,col("usertime.bus") == col("bustime.bus"),"inner").select(
    userDf.user,
    userDf.time.alias("user_time"),
    busDf.bus,
    busDf.time.alias("bus_time"),
    busDf.stop)

additional_cols = joinedDF.withColumn("bus_time_diff",  abs(unix_timestamp(col("bus_time")) - unix_timestamp(col("user_time"))))

partDf = additional_cols.select("user","user_time","bus","bus_time","stop","bus_time_diff", W.rowNumber().over(Window.partitionBy("user","bus").orderBy("bus_time_diff") ).alias("rank") ).filter(col("rank") == 1)


additional_cols.show(20,False)
partDf.show(20,False)