Pyspark pyarrow pandas_udf-GROUPED_MAP返回数据帧,对于IntegerType,TimestampType,无NaN

时间:2018-12-03 09:57:34

标签: python pandas apache-spark pyspark pyarrow

最佳

此刻,我正在尝试pyspark pandas_udf ,但不幸的是,当我返回其中包含NA,None或NaN的DataFrame时,遇到了一些问题。如果我使用的是FloatType,则结果很好,但是一旦使用IntegerType,TimestampType等,我就会收到错误消息,并且不再起作用。

以下是一些可行和无效的示例:

What does work?
示例1)

custom_schema = StructType([
                        StructField('User',StringType(),True),
                        StructField('Sport',StringType(),True),
                        StructField('Age',IntegerType(),True),
                        StructField('Age_lag',FloatType(),True),
                        ])

# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
    # Input/output are both a pandas.DataFrame

    #return a totalaly different DataFrame
    dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]})
    dt['Age_lag'] = dt['Age'].shift(1)

    return dt

df.groupby('id').apply(my_custom_function).toPandas() 

结果:

    User    Sport   Age     Age_lag
0   Alice   Football    27  NaN
1   Bob     Basketball  34  27.0
2   Alice   Football    27  NaN
3   Bob     Basketball  34  27.0

示例2)

如果我们将 Age_lag Type 更改为 IntegerType()并用-1填充Na,那么我们仍然有一个有效的结果(没有NaN)

custom_schema = StructType([
                        StructField('User',StringType(),True),
                        StructField('Sport',StringType(),True),
                        StructField('Age',IntegerType(),True),
                        StructField('Age_lag',IntegerType(),True),
                        ])

# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
    # Input/output are both a pandas.DataFrame

    #return a totalaly different DataFrame
    dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]})
    dt['Age_lag'] = dt['Age'].shift(1).fillna(-1)

    return dt

df.groupby('id').apply(my_custom_function).toPandas() 

结果:

    User    Sport   Age     Age_lag
0   Alice   Football    27  -1
1   Bob     Basketball  34  27
2   Alice   Football    27  -1
3   Bob     Basketball  34  27



什么不起作用?

示例3)

如果,我们省略了 .fillna(-1),那么我们将收到下一个错误

custom_schema = StructType([
                        StructField('User',StringType(),True),
                        StructField('Sport',StringType(),True),
                        StructField('Age',IntegerType(),True),
                        StructField('Age_lag',IntegerType(),True),
                        ])

# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
    # Input/output are both a pandas.DataFrame

    #return a totalaly different DataFrame
    dt = pd.DataFrame({'User': ['Alice', 'Bob'], 'Sport': ['Football', 'Basketball'], 'Age': [27, 34]})
    dt['Age_lag'] = dt['Age'].shift(1)

    return dt

df.groupby('id').apply(my_custom_function).toPandas() 

结果: pyarrow.lib.ArrowInvalid:浮点值被截断



示例4)

最后但并非最不重要的一点是,如果我们仅发送回一个静态数据帧,其中age_lag包含 None ,那么它也不起作用。

from pyspark.sql.types import StructType,NullType, StructField,FloatType, LongType, DoubleType, StringType, IntegerType
# true means,  accepts nulls
custom_schema = StructType([
                        StructField('User',StringType(),True),
                        StructField('Sport',StringType(),True),
                        StructField('Age',IntegerType(),True),
                        StructField('Age_lag',IntegerType(),True),
                        ])

# the schema is what it needs as an output format
@pandas_udf(custom_schema, PandasUDFType.GROUPED_MAP)
def my_custom_function(pdf):
    # Input/output are both a pandas.DataFrame

    #return a totalaly different DataFrame
    dt = pd.DataFrame({'User': ['Alice', 'Bob'], 
                      'Sport': ['Football', 'Basketball'], 
                        'Age': [27, 34], 
                    'Age_lag': [27, None]})


    return dt

df.groupby('id').apply(my_custom_function).toPandas() 

问题

  • 您如何处理?
  • 设计不好吗?
    • (因为我可以想象有1000种情况我确实想返回NaN和None)
  • 我们真的必须填写所有缺失的值吗?然后把它们放回去?或使用浮点数而不是整数?等等?
  • 这会在不久的将来解决吗? (因为pandas_udf很新)

0 个答案:

没有答案