我有一个函数来描述pyspark中的统计信息和分位数。 到目前为止,这是我的代码。
def basic_stats(df_in, columns, deciles=False):
"""
Function to union the basic stats results and deciles
:param df_in: the input dataframe
:param columns: the cloumn name list of the numerical variable
:param deciles: the deciles output
:return : the numerical describe info. of the input dataframe
"""
if deciles:
percentiles = np.array(range(0, 110, 10))
else:
percentiles = [25, 50, 75]
percs = np.transpose([np.percentile(df_in.select(x).collect(), percentiles) for x in columns])
percs = pd.DataFrame(percs, columns=columns)
percs['summary'] = [str(p) + '%' for p in percentiles]
spark_describe = df_in.describe().toPandas()
new_df = pd.concat([spark_describe, percs],ignore_index=True)
new_df = Decimal(new_df.round(2))
return new_df[['summary'] + columns]
num_cols = [item[0] for item in df.dtypes if item[1].startswith(("decimal", "bigint", "float"))]
basic_stats(df, num_cols, deciles=True)
我的问题是出现以下错误:
*:不支持的操作数类型:“浮点数”和“十进制” **
我不明白为什么这不起作用!