这些图书有两列books
和readers
,其中books
和readers
分别是图书和读者ID。
当他们试图通过他们阅读的书籍订购读者时,我得到AbstractSparkSQLParser
例外:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.functions._
object Small {
case class Book(book: Int, reader: Int)
val recs = Array(
Book(book = 1, reader = 30),
Book(book = 2, reader = 10),
Book(book = 3, reader = 20),
Book(book = 1, reader = 20),
Book(book = 1, reader = 10),
Book(book = 1, reader = 40),
Book(book = 2, reader = 40),
Book(book = 2, reader = 30))
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Small")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(recs).toDF()
val readerGroups = df.groupBy("reader").count()
readerGroups.show()
readerGroups.registerTempTable("readerGroups")
readerGroups.printSchema()
// "SELECT reader, count FROM readerGroups ORDER BY count DESC"
val readerGroupsSorted = sqlContext.sql("SELECT * FROM readerGroups ORDER BY count DESC")
readerGroupsSorted.show()
println("Group Cnt: "+readerGroupsSorted.count())
这是一个输出,'groupBy`可以正常工作:
reader count
40 2
10 2
20 2
30 2
结果架构:
root
|-- reader: integer (nullable = false)
|-- count: long (nullable = false)
然而SELECT * FROM readerGroups ORDER BY count DESC
失败但有异常(见下文)。事实上,除了 select
和SELECT * FROM readerGroups
之外,所有其他SELECT reader FROM readerGroups
rtequest也会失败 - 这些都有效。这是为什么?
如何让ORDER BY count DESC
工作?
Exception in thread "main" java.lang.RuntimeException: [1.43] failure: ``('' expected but `desc' found
SELECT * FROM readerGroups ORDER BY count DESC
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:134)
at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:138)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:933)
at Small$.main(Small.scala:60)
at Small.main(Small.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
答案 0 :(得分:3)
问题是colum COUNT的名称。 COUNT是spark中的保留字,因此您无法使用其名称进行查询,也无法使用此字段进行排序。
您可以尝试使用反引号:
select * from readerGroups ORDER BY `count` DESC
另一个选项是将列数重命名为不同的NumReaders或其他......
答案 1 :(得分:0)
使用派生表按计算字段排序(例如top,max,count ...)
SELECT * FROM
(
SELECT reader, count(book) AS book_count
FROM readerbook
GROUP by reader) a
ORDER BY book_count desc
实际上第二个想法是,如果您使用这样的别名,可能只需要执行您的订单:
SELECT reader, count(book) AS book_count
FROM readerbook
GROUP by reader
ORDER BY book_count desc