最佳
此刻,我正在尝试pyspark pandas_udf ,但不幸的是,当我返回其中包含NA,None或NaN的DataFrame时,遇到了一些问题。如果我使用的是FloatType,则结果很好,但是一旦使用IntegerType,TimestampType等,我就会收到错误消息,并且不再起作用。
以下是一些可行和无效的示例:
What does work?
示例1)
custom_schema = StructType([
StructField('User',StringType(),True),
StructField('Sport',StringType(),True),
StructField('Age',IntegerType(),True),
StructField('Age_lag',FloatType(),True),
])
# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
# Input/output are both a pandas.DataFrame
#return a totalaly different DataFrame
dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]})
dt['Age_lag'] = dt['Age'].shift(1)
return dt
df.groupby('id').apply(my_custom_function).toPandas()
结果:
User Sport Age Age_lag
0 Alice Football 27 NaN
1 Bob Basketball 34 27.0
2 Alice Football 27 NaN
3 Bob Basketball 34 27.0
示例2)
如果我们将 Age_lag 的 Type 更改为 IntegerType()并用-1填充Na,那么我们仍然有一个有效的结果(没有NaN)
custom_schema = StructType([
StructField('User',StringType(),True),
StructField('Sport',StringType(),True),
StructField('Age',IntegerType(),True),
StructField('Age_lag',IntegerType(),True),
])
# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
# Input/output are both a pandas.DataFrame
#return a totalaly different DataFrame
dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]})
dt['Age_lag'] = dt['Age'].shift(1).fillna(-1)
return dt
df.groupby('id').apply(my_custom_function).toPandas()
结果:
User Sport Age Age_lag
0 Alice Football 27 -1
1 Bob Basketball 34 27
2 Alice Football 27 -1
3 Bob Basketball 34 27
什么不起作用?
示例3)
如果,我们省略了 .fillna(-1),那么我们将收到下一个错误
custom_schema = StructType([
StructField('User',StringType(),True),
StructField('Sport',StringType(),True),
StructField('Age',IntegerType(),True),
StructField('Age_lag',IntegerType(),True),
])
# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
# Input/output are both a pandas.DataFrame
#return a totalaly different DataFrame
dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]})
dt['Age_lag'] = dt['Age'].shift(1)
return dt
df.groupby('id').apply(my_custom_function).toPandas()
结果: pyarrow.lib.ArrowInvalid:浮点值被截断
示例4)
最后但并非最不重要的一点是,如果我们仅发送回一个静态数据帧,其中age_lag包含 None ,那么它也不起作用。
from pyspark.sql.types import StructType,NullType, StructField,FloatType, LongType, DoubleType, StringType, IntegerType
# true means, accepts nulls
custom_schema = StructType([
StructField('User',StringType(),True),
StructField('Sport',StringType(),True),
StructField('Age',IntegerType(),True),
StructField('Age_lag',IntegerType(),True),
])
# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
# Input/output are both a pandas.DataFrame
#return a totalaly different DataFrame
dt = pd.DataFrame({'User': ['Alice', 'Bob'],
'Sport': ['Football', 'Basketball'],
'Age': [27, 34],
'Age_lag': [27, None]})
return dt
df.groupby('id').apply(my_custom_function).toPandas()
问题: