我有一个具有以下架构的 spark 数据框
StructType(List(StructField(col1,IntegerType,true),StructField(col2,IntegerType,true),StructField(col3,StringType,true),StructField(col4,StringType,true),StructField(col5,StringType,true),StructField(col6,StringType,true),StructField(col7,StringType,true),StructField(col8,StringType,true),StructField(col9,StringType,true),StructField(col10,DecimalType(15,3),true),StructField(col11,IntegerType,true),StructField(col12,DecimalType(18,0),true),StructField(group_col,IntegerType,true),StructField(col13,IntegerType,true),StructField(col14,IntegerType,true),StructField(col15,StringType,true),StructField(col16,StringType,true),StructField(col17,StringType,true),StructField(col18,StringType,true),StructField(col19,DecimalType(5,4),true),StructField(col20,DecimalType(5,4),true),StructField(col21,DecimalType(5,4),true),StructField(col22,IntegerType,true),StructField(col23,IntegerType,true)))
我正在使用以下熊猫 udf
schema = StructType([StructField("grouped_key", IntegerType(), True),
StructField("auc", DoubleType(), True),
StructField("mean_accu", DoubleType(), True),
StructField("prec_5", DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def get_performance(df_ag):
group_key = df_ag['grouped_key'].iloc[0]
# modeling code based on the grouped key
return pd.DataFrame.from_dict({'grouped_key': [group_key],
'auc': [5.0],
'mean_accu': [5.0],
'prec_5': [5.0]})
我总是得到以下错误
df = df.groupby('attr_grp_cd').apply(get_performance)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/group.py", line 274, in apply
udf_column = udf(*[df[col] for col in df.columns])
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o186.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
如果您看到当前代码,我已经删除了所有内容,并且只是为给定的分组键取出硬编码的行值
df = df.groupby('grouped_key').apply(get_performance)
我的假设是它无法序列化某些对象,但我无法从错误跟踪中判断它可能是什么,尤其是当我没有在 udf 中执行任何操作时。非常感谢任何帮助
另外,我正在使用 aws emr。有没有办法检查执行器内部的日志?我放入 udf 本身的任何日志都不会输出,我想知道从哪里获取这些日志
使用 spark 2.4.4