我正在尝试在pyspark中创建UDF,以将每一列的精度四舍五入到每一行中的另一列,例如以下数据框:
<FormControlLabel
value={this.props.value} //Pass this
onChange={this.props.onChange} //Pass this one too
checked={this.props.checked} //Also this
control={<Radio name="gender" />}
label={
<TextField
id="standard-bare"
defaultValue={this.props.defaultValue}
margin="normal"
onChange={this.props.onTextChange}
/>
}
/>
提交给UDF时,应给出以下结果:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
我特别尝试了以下代码:
+--------+--------+--------------+
| Data|Rounding|Rounded Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
但出现以下错误:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, LongType,
IntegerType
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = SparkSession.builder.master("local").appName("column
rounding").getOrCreate()
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
def round_column(Data, Rounding):
return (lambda (Data, Rounding): round(Data, Rounding), FloatType())
spark.udf.register("column rounded to the precision specified by another",
round_column, FloatType())
df_rounded = df.withColumn('Rounded Column', round_column(df["Data"],
df["Rounding"]))
df_rounded .show()
任何帮助将不胜感激:)
答案 0 :(得分:1)
您的代码失败,因为round_column
不是有效的udf
。您应该
from pyspark.sql.functions import udf
@udf(FloatType())
def round_column(data, rounding):
return round(data, rounding)
spark.udf.register
用于注册从SQL查询调用的函数,因此此处不适用。
但是您根本不需要udf
。只是:
from pyspark.sql.functions import expr
df_rounded = df.withColumn('Rounded Column', 'expr(round(Data, Rounding))')
答案 1 :(得分:1)
如另一个答案中所述,您的udf无效。
您可以如下使用嵌入式udf:
udf_round_column = udf(lambda row: round(row['data'], row['rounding']), FloatType())
df_rounded = df.withColumn('rounded_col', udf_round_column(struct('data', 'rounding')))
或作为单独的功能:
def round_column(data, rounding):
return round(data, rounding)
udf_round_column= udf(round_column, FloatType())
df_rounded = df.withColumn('rounded_col', udf_round_to_decimal('data', 'rounding'))
两者都返回此:
+---+---------+--------+-----------+
| id| data|rounding|rounded_col|
+---+---------+--------+-----------+
| 1|3.1415926| 3| 3.142|
| 2| 0.12345| 6| 0.12345|
| 3| 2.3456| 1| 2.3|
+---+---------+--------+-----------+
答案 2 :(得分:0)
如果您要将UDF应用于数据框,则可以像
一样简单地将其导入 from pyspark.sql.functions import udf
并像使用它
round_column_udf = udf(round_column, FloatType())
df_rounded = df.withColumn('Rounded_Column', round_column_udf(df['Data'], df['Rounding']))
注册udf可以与Spark sql查询一起使用
spark.udf.register("round_column_udf",round_column, FloatType())
df.registerTempTable("df")
spark.sql("select Data, Rounding,round_column_udf(Data, Rounding) as Rounded_Column from df").show()
两者都应该起作用。