Question

spark-shell --packages com.databricks：spark-csv_2.11：1.2.0 1.使用SQLContext ~~~~~~~~~~~~~~~~~~~~ 1. import org.apache.spark.sql.SQLContext 2. val sqlctx = new SQLContext（sc） 3. import sqlctx ._

val df = sqlctx.read.format（“com.databricks.spark.csv”）。option（“inferScheme”，“true”）。option（“delimiter”，“;”）。option（ “报头”， “真”）。负载（ “/用户/ Cloudera的/ data.csv”）
df.select（avg（$“col1”））。show（）//这很好用
sqlctx.sql（“选择百分位数_approx（余额，0.5）作为来自port_bank_table的中位数”）。show（）要么 sqlctx.sql（“选择百分位数（余额，0.5）作为来自port_bank_table的中位数”）。show（） //两者都不起作用，得到以下错误

org.apache.spark.sql.AnalysisException：undefined function percentile_approx; 0行pos 0 在org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry $$ anonfun $ 2.apply（FunctionRegistry.scala：65）在org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry $$ anonfun $ 2.apply（FunctionRegistry.scala：65）
使用HiveContext ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 所以尝试使用hive上下文阶＆GT; import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext

阶＆GT; val hivectx =新的HiveContext（sc） 18/01/09 22:51:06 WARN metastore.ObjectStore：无法获取数据库默认值，返回NoSuchObjectException hivectx：org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@5be91161

阶＆GT; import hivectx._ import hivectx ._

getting the below error 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@be453c4, 
see the next exception for details.

Answer 1

我在Spark聚合函数中找不到任何percentile_approx，百分位函数。看起来这个功能似乎没有内置到Spark DataFrames中。有关详情，请点击此How to calculate Percentile of column in a DataFrame in spark? 我希望它会帮助你。

Answer 2

I don't think so, it should work, for that you should save the table in 
dataFrame using saveAsTable. Then you will be able to run your query using 
HiveContext.

df.someDF.write.mode(SaveMode.Overwrite) 
              .format("parquet")
              .saveAsTable("Table_name")

# In my case "mode" is working as mode("Overwrite")

hivectx.sql("select avg(col1) as median from Table_name").show()

It will work.

使用hadoop spark1.6数据帧计算中位数，平均值，无法启动数据库'metastore_db'

2 个答案: