通过spark_df_profiling_optimus库解决问题

时间:2019-05-20 14:01:38

标签: python hadoop

使用spark_df_profiling_optimus进行分析时遇到以下问题

report=spark_df_profiling_optimus.ProfileReport(spark_df)

得到以下错误:

report=spark_df_profiling_optimus.ProfileReport(spark_df)
  File "/home/dmp_admin/anaconda2/lib/python2.7/site-packages/spark_df_profiling_optimus-0.1.1-py2.7.egg/spark_df_profiling_optimus/__init__.py", line 19, in __init__
    description_set = describe(df, bins=bins, corr_reject=corr_reject, **kwargs)
  File "/home/dmp_admin/anaconda2/lib/python2.7/site-packages/spark_df_profiling_optimus-0.1.1-py2.7.egg/spark_df_profiling_optimus/base.py", line 440, in describe
    ldesc = {column: describe_1d(df, column, table_stats["n"]) for column in df.columns}
  File "/home/dmp_admin/anaconda2/lib/python2.7/site-packages/spark_df_profiling_optimus-0.1.1-py2.7.egg/spark_df_profiling_optimus/base.py", line 440, in <dictcomp>
    ldesc = {column: describe_1d(df, column, table_stats["n"]) for column in df.columns}
  File "/home/dmp_admin/anaconda2/lib/python2.7/site-packages/spark_df_profiling_optimus-0.1.1-py2.7.egg/spark_df_profiling_optimus/base.py", line 406, in describe_1d
    result = result.append(describe_integer_1d(df, column, result, nrows))
  File "/home/dmp_admin/anaconda2/lib/python2.7/site-packages/spark_df_profiling_optimus-0.1.1-py2.7.egg/spark_df_profiling_optimus/base.py", line 209, in describe_integer_1d
    .format(col=column, n=x)).toPandas().ix[:,0]
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 876, in selectExpr
    jdf = self._jdf.selectExpr(self._jseq(expr))
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'undefined function percentile;'

0 个答案:

没有答案