我有2个主要表格:航班和假期。
航班的标识为:outboundlegid, inboundlegid, agent, querydatetime
。适用于该问题的其他列为out_date, in_date
。它们指示航班何时起飞以及返回日期。
“假期”列为start, end, type
我想确定假期的出发/起飞日期是否与假日表中的任何东西相交。
我遵循PySpark: How to add columns whose data come from a query (similar to subquery for each row)的一些建议来确定出/入日是否与任何假期相交。
但是,我得到:“ pyspark.sql.utils.ParseException:u” \ nextraneous
输入'outboundlegid'期望为{')',','}(第35行,位置12)“。这是怎么回事?
文件“ script_2019-02-08-10-46-14.py”,第182行,“”“中)文件 “ /mnt/yarn/usercache/root/appcache/application_1549622095592_0002/container_1549622095592_0002_01_000001/pyspark.zip/pyspark/sql/session.py”, sql文件中的第603行 “ /mnt/yarn/usercache/root/appcache/application_1549622095592_0002/container_1549622095592_0002_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py”, 第1133行,在通话文件中 “ /mnt/yarn/usercache/root/appcache/application_1549622095592_0002/container_1549622095592_0002_01_000001/pyspark.zip/pyspark/sql/utils.py”, 第73行,在装饰pyspark.sql.utils.ParseException中:u“ \ nextraneous 输入'outboundlegid'期望{')',','}(第35行,位置12)\ n \ n == SQL == \ n \ n WITH t(\ n SELECT \ n f.outboundlegid,\ n f.inboundlegid,\ n f.agent,\ n f.querydatetime,\ n类型='HOLIDAY'AND(out_date 在开始和结束之间)\ n然后为真\ n其他为假\ n结束 out_is_holiday,\ n输入类型='LONG_WEEKENDS'AND(out_date 在开始和结束之间)\ n然后为真\ n其他为假\ n结束 out_is_longweekends,\ n类型='HOLIDAY'且(在in_date之间 开始和结束)\ n然后为真\ n其他为假\ n结束in_is_holiday,\ n 当类型='LONG_WEEKENDS'AND(开始和结束之间的in_date)\ n THEN 正确\ n ELSE否\ n结束in_is_longweekends \ n从航班f \ n穿越 加入假期h \ n)\ n选择\ n f。*,\ n t1.out_is_holiday,\ n t1.out_is_longweekends,\ n t1.in_is_holiday,\ n t1.in_is_longweekends,\ n FROM(\ n选择\ n outboundlegid,\ n ------------ ^^^ \ n inboundlegid,\ n 代理,\ n查询日期时间,\ n情况为 array_contains(collect_set(out_is_holiday),true)\ n然后为true \ n ELSE false \ n END out_is_holiday,\ n情况如下 array_contains(collect_set(out_is_longweekends),true)\ n然后是true \ n ELSE错误\ n END out_is_long周末,\ n array_contains(collect_set(in_is_holiday),true)\ n然后是true \ n ELSE false \ n END in_is_holiday,\ n情况如下 array_contains(collect_set(in_is_longweekends),true)\ n然后是true \ n 否则为假\ n
这是什么问题?
resultDf = spark.sql("""
WITH t (
SELECT
f.outboundlegid,
f.inboundlegid,
f.agent,
f.querydatetime,
CASE WHEN type = 'HOLIDAY' AND (out_date BETWEEN start AND end)
THEN true
ELSE false
END out_is_holiday,
CASE WHEN type = 'LONG_WEEKENDS' AND (out_date BETWEEN start AND end)
THEN true
ELSE false
END out_is_longweekends,
CASE WHEN type = 'HOLIDAY' AND (in_date BETWEEN start AND end)
THEN true
ELSE false
END in_is_holiday,
CASE WHEN type = 'LONG_WEEKENDS' AND (in_date BETWEEN start AND end)
THEN true
ELSE false
END in_is_longweekends
FROM flights f
CROSS JOIN holidays h
)
SELECT
f.*,
t1.out_is_holiday,
t1.out_is_longweekends,
t1.in_is_holiday,
t1.in_is_longweekends,
FROM (
SELECT
outboundlegid, # <<< I am guessing something wrong with this? But Why?
inboundlegid,
agent,
querydatetime,
CASE WHEN array_contains(collect_set(out_is_holiday), true)
THEN true
ELSE false
END out_is_holiday,
CASE WHEN array_contains(collect_set(out_is_longweekends), true)
THEN true
ELSE false
END out_is_longweekends,
CASE WHEN array_contains(collect_set(in_is_holiday), true)
THEN true
ELSE false
END in_is_holiday,
CASE WHEN array_contains(collect_set(in_is_longweekends), true)
THEN true
ELSE false
END in_is_longweekends
FROM t
GROUP BY
querydatetime,
outboundlegid,
inboundlegid,
agent
LIMIT 100000
) t1
INNER JOIN flights f
ON t1.querydatetime = f.querydatetime
AND t1.outboundlegid = f.outboundlegid
AND t1.inboundlegid = f.inboundlegid
AND t1.agent = f.agent
INNER JOIN agents a
ON f.agent = a.id
INNER JOIN airports p
ON f.querydestinationplace = p.airportId
""")