在Pyspark中使用IN子句声明的情况

时间:2016-04-26 19:33:36

标签: apache-spark pyspark pyspark-sql

我是Spark编程的新手,有一个场景可以在输入中出现一组值时指定一个值。下面是我将用来完成任务的传统SQL代码。需要在Spark中做同样的事情。

Sql代码:

SELECT CASE WHEN c.Number IN ( '1121231', '31242323' ) THEN 1 
ELSE 2 END AS Test
FROM   Input  c

我知道只使用一个条件就可以在spark中使用when

Input.select(when(Input.Number==1121231,1).otherwise(2).alias("Test")).show()

1 个答案:

答案 0 :(得分:6)

我假设你正在使用Spark DataFrames,而不是RDD。需要注意的一点是,您可以直接在DataFrame上运行SQL查询:

# register the DataFrame so we can refer to it in queries
sqlContext.registerDataFrameAsTable(df, "df")

# put your SQL query in a string
query = """SELECT CASE WHEN 
    df.number IN ('1121231', '31242323') THEN 1 ELSE 2 END AS test 
    FROM df"""

result = sqlContext.sql(query)
result.show()

您还可以通过创建模仿查询案例陈述的user-defined function来使用select

from pyspark.sql.types import *
from pyspark.sql.functions import udf

# need to pass inner function through udf() so it can operate on Columns
# also need to specify return type
column_in_list = udf(
    lambda column: 1 if column in ['1121231', '31242323'] else 2, 
    IntegerType()
)

# call function on column, name resulting column "transformed"
result = df.select(column_in_list(df.number).alias("transformed"))
result.show()