使用hadoop spark1.6数据帧计算中位数,平均值,无法启动数据库'metastore_db'

时间:2018-01-10 06:57:54

标签: spark-dataframe hadoop2 median hivecontext apache-spark-1.6

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 1.使用SQLContext ~~~~~~~~~~~~~~~~~~~~ 1. import org.apache.spark.sql.SQLContext 2. val sqlctx = new SQLContext(sc) 3. import sqlctx ._

  1. val df = sqlctx.read.format(“com.databricks.spark.csv”)。option(“inferScheme”,“true”)。option(“delimiter”,“;”)。option( “报头”, “真”)。负载( “/用户/ Cloudera的/ data.csv”)

  2. df.select(avg($“col1”))。show()//这很好用

  3. sqlctx.sql(“选择百分位数_approx(余额,0.5)作为来自port_bank_table的中位数”)。show() 要么 sqlctx.sql(“选择百分位数(余额,0.5)作为来自port_bank_table的中位数”)。show() //两者都不起作用,得到以下错误

    org.apache.spark.sql.AnalysisException:undefined function percentile_approx; 0行pos 0 在org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry $$ anonfun $ 2.apply(FunctionRegistry.scala:65) 在org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry $$ anonfun $ 2.apply(FunctionRegistry.scala:65)

  4. 使用HiveContext ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 所以尝试使用hive上下文 阶> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext

  5. 阶> val hivectx =新的HiveContext(sc) 18/01/09 22:51:06 WARN metastore.ObjectStore:无法获取数据库默认值,返回NoSuchObjectException hivectx:org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@5be91161

    阶> import hivectx._ import hivectx ._

    getting the below error 
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@be453c4, 
    see the next exception for details.
    

2 个答案:

答案 0 :(得分:0)

我在Spark聚合函数中找不到任何percentile_approx,百分位函数。看起来这个功能似乎没有内置到Spark DataFrames中。有关详情,请点击此How to calculate Percentile of column in a DataFrame in spark? 我希望它会帮助你。

答案 1 :(得分:0)

I don't think so, it should work, for that you should save the table in 
dataFrame using saveAsTable. Then you will be able to run your query using 
HiveContext.

df.someDF.write.mode(SaveMode.Overwrite) 
              .format("parquet")
              .saveAsTable("Table_name")

# In my case "mode" is working as mode("Overwrite")

hivectx.sql("select avg(col1) as median from Table_name").show()

It will work.