我有一个名为tags(UserId,MovieId,Tag)
的文件作为算法的输入,并通过registerTempTable将其转换为表。
val orderedId = sqlContext.sql("SELECT MovieId AS Id,Tag FROM tag ORDER BY MovieId")
此查询给我的文件由Id,tag组成,作为第二步的输入
val eachTagCount =orderedId.groupBy(" Id,Tag").count()
但出现错误
case class DataClass( MovieId:Int,UserId: Int, Tag: String)
// Create an RDD of DataClass objects and register it as a table.
val Data = sc.textFile("file:///usr/local/spark/dataset/tagupdate").map(_.split(",")).map(p => DataClass(p(0).trim.toInt, p(1).trim.toInt, p(2).trim)).toDF()
Data.registerTempTable("tag")
val orderedId = sqlContext.sql("SELECT MovieId AS Id,Tag FROM tag ORDER BY MovieId")
orderedId.rdd
.map(_.toSeq.map(_+"").reduce(_+","+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output")
val eachTagCount =orderedId.groupBy(" Id,Tag").count()
eachTagCount.rdd
.map(_.toSeq.map(_+"").reduce(_+","+_))
.saveAsTextFile("/usr/local/spark/dataset/algorithm3/output2")
例外:
Caused by: org.apache.spark.sql.AnalysisException: Cannot resolve column name " Id,Tag" among (Id, Tag);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:121)
如何解决此错误?
答案 0 :(得分:1)
尝试一下 val eachTagCount = orderedId.groupBy(“ Id”,“ Tag”)。count()。 您将单个字符串用于多列。