Question

我正在使用pyspark查询sql表。

如果我有一个包含两列（值，isDelayed）的sql表，其中“值” 是双精度型，而“ isDelayed” 的值为0或1。如何使用pyspark聚合查询编写查询，当“ isDelayed”为1时给出“值”之和。

我已经尝试了下面给出错误的代码

def __main__(self, data):
    delayedData = data.where(col('isDelayed').cast('int')==='1')
    groupByIsDelayed = delayedData.agg(sum(total))
    return groupByIsDelayed

我要

“语法错误：语法无效”

在行下

delayedData = data.where(col('isDelayed').cast('int')==='1')

Answer 1

将data.where(col('isDelayed').cast('int')==='1')替换为data.where(col('isDelayed').cast('int') == 1)

仅2 =（python中的相等运算符为2 =符号）
1不带引号（因为您比较的是整数，而不是字符串）

或

data.where("isDelayed=1")

在SQL表上使用pyspark编写where查询

1 个答案:

或