spark-shell --packages com.databricks:spark-csv_2.11:1.2.0 1.使用SQLContext ~~~~~~~~~~~~~~~~~~~~ 1. import org.apache.spark.sql.SQLContext 2. val sqlctx = new SQLContext(sc) 3. import sqlctx ._
val df = sqlctx.read.format(“com.databricks.spark.csv”)。option(“inferScheme”,“true”)。option(“delimiter”,“;”)。option( “报头”, “真”)。负载( “/用户/ Cloudera的/ data.csv”)
df.select(avg($“col1”))。show()//这很好用
sqlctx.sql(“选择百分位数_approx(余额,0.5)作为来自port_bank_table的中位数”)。show() 要么 sqlctx.sql(“选择百分位数(余额,0.5)作为来自port_bank_table的中位数”)。show() //两者都不起作用,得到以下错误
org.apache.spark.sql.AnalysisException:undefined function percentile_approx; 0行pos 0 在org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry $$ anonfun $ 2.apply(FunctionRegistry.scala:65) 在org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry $$ anonfun $ 2.apply(FunctionRegistry.scala:65)
使用HiveContext ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 所以尝试使用hive上下文 阶> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext
阶> val hivectx =新的HiveContext(sc) 18/01/09 22:51:06 WARN metastore.ObjectStore:无法获取数据库默认值,返回NoSuchObjectException hivectx:org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@5be91161
阶> import hivectx._ import hivectx ._
getting the below error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@be453c4,
see the next exception for details.
答案 0 :(得分:0)
我在Spark聚合函数中找不到任何percentile_approx,百分位函数。看起来这个功能似乎没有内置到Spark DataFrames中。有关详情,请点击此How to calculate Percentile of column in a DataFrame in spark? 我希望它会帮助你。
答案 1 :(得分:0)
I don't think so, it should work, for that you should save the table in
dataFrame using saveAsTable. Then you will be able to run your query using
HiveContext.
df.someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("Table_name")
# In my case "mode" is working as mode("Overwrite")
hivectx.sql("select avg(col1) as median from Table_name").show()
It will work.