Pyspark UDF将一列舍入到另一列指定的精度

时间:2018-10-08 10:15:22

标签: apache-spark pyspark user-defined-functions

我正在尝试在pyspark中创建UDF,以将每一列的精度四舍五入到每一行中的另一列,例如以下数据框:

<FormControlLabel
    value={this.props.value}       //Pass this
    onChange={this.props.onChange} //Pass this one too
    checked={this.props.checked}   //Also this
    control={<Radio name="gender" />}
    label={
      <TextField
        id="standard-bare"
        defaultValue={this.props.defaultValue}
        margin="normal"
        onChange={this.props.onTextChange}
      />
    }
/>

提交给UDF时,应给出以下结果:

+--------+--------+
|    Data|Rounding|
+--------+--------+
|3.141592|       3|
|0.577215|       1|
+--------+--------+

我特别尝试了以下代码:

+--------+--------+--------------+
|    Data|Rounding|Rounded Column|
+--------+--------+--------------+
|3.141592|       3|         3.142|
|0.577215|       1|           0.6|
+--------+--------+--------------+

但出现以下错误:

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, LongType, 
IntegerType

pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3], 
   [0.577215, 1]])

mySchema = StructType([ StructField("Data", FloatType(), True), 
StructField("Rounding", IntegerType(), True)])

spark = SparkSession.builder.master("local").appName("column 
rounding").getOrCreate()

df = spark.createDataFrame(pdDF,schema=mySchema)

df.show()

def round_column(Data, Rounding):
return (lambda (Data, Rounding): round(Data, Rounding), FloatType())

spark.udf.register("column rounded to the precision specified by another", 
round_column, FloatType())


df_rounded = df.withColumn('Rounded Column', round_column(df["Data"], 
df["Rounding"]))

df_rounded .show()

任何帮助将不胜感激:)

3 个答案:

答案 0 :(得分:1)

您的代码失败,因为round_column不是有效的udf。您应该

from pyspark.sql.functions import udf

@udf(FloatType())
def round_column(data, rounding):
    return round(data, rounding)

spark.udf.register用于注册从SQL查询调用的函数,因此此处不适用。

但是您根本不需要udf。只是:

from pyspark.sql.functions import expr

df_rounded = df.withColumn('Rounded Column', 'expr(round(Data, Rounding))')

答案 1 :(得分:1)

如另一个答案中所述,您的udf无效。

您可以如下使用嵌入式udf:

udf_round_column = udf(lambda row: round(row['data'], row['rounding']), FloatType())
df_rounded = df.withColumn('rounded_col', udf_round_column(struct('data', 'rounding')))

或作为单独的功能:

def round_column(data, rounding):
    return round(data, rounding)

udf_round_column= udf(round_column, FloatType())
df_rounded = df.withColumn('rounded_col', udf_round_to_decimal('data', 'rounding'))

两者都返回此:

+---+---------+--------+-----------+
| id|     data|rounding|rounded_col|
+---+---------+--------+-----------+
|  1|3.1415926|       3|      3.142|
|  2|  0.12345|       6|    0.12345|
|  3|   2.3456|       1|        2.3|
+---+---------+--------+-----------+

答案 2 :(得分:0)

如果您要将UDF应用于数据框,则可以像

一样简单地将其导入

from pyspark.sql.functions import udf

并像使用它

round_column_udf = udf(round_column, FloatType()) df_rounded = df.withColumn('Rounded_Column', round_column_udf(df['Data'], df['Rounding']))

注册udf可以与Spark sql查询一起使用

spark.udf.register("round_column_udf",round_column, FloatType()) df.registerTempTable("df") spark.sql("select Data, Rounding,round_column_udf(Data, Rounding) as Rounded_Column from df").show()

两者都应该起作用。