如何过滤Spark Dataframe

时间:2015-09-15 23:07:21

标签: scala apache-spark dataframe apache-spark-sql

我有一个Spark Dataframe,其中一个字段是MapType ....我可以获取maptype字段的任何键的数据,但是当我为特定值应用过滤器时无法做到一个特定的关键......

val line = List (("Sanjay", Map("one" -> 1, "two" -> 2)), ("Taru", Map("one" -> 10, "two" -> 20)) )

我创建了RDD&以上列表中的DF&我试图获取DF,Map值,其中值为> = 5 .....但我在Spark Repl中得到以下异常..请帮助

val rowrddDFFinal = rowrddDF.select(rowrddDF("data.one").alias("data")).filter(rowrddDF("data.one").geq(5))
  

org.apache.spark.sql.AnalysisException:已解析的属性数据#1   想着                                                     // | g来自运算符中的数据#3!过滤器(数据#1 [one] AS one#4> = 5);                                                     // |在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $ class.failAnalys                                                     // |是(CheckAnalysis.scala:38)                                                     // |在org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer                                                     // | .scala:42)                                                     // |在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAn                                                     // | alysis $ 1.适用(CheckAnalysis.scala:121)                                                     // |在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAn                                                     // | alysis $ 1.适用(CheckAnalysis.scala:50)                                                     // |在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala                                                     // | :98)                                                     // |在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $ class.checkAnaly                                                     // | SIS(CheckAnalysis.scala:50)                                                     // | at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyze                                                     // | r.scala:42)                                                     // | at org.apache.spark.sql.SQLContext $ QueryExecution.assertAnalyzed(SQLCont)                                                     // | ext.scala:931)

1 个答案:

答案 0 :(得分:8)

要访问ArrayMap列中的值,您可以使用Column.getItem方法:

rowrddDF
 .where($"data".getItem("one").geq(5))
 .select($"data".getItem("one").alias("data"))

如果您希望在filter之后select,则不再使用rowrddDF.apply。相反,您应该直接访问别名列:

df
  .select($"data".getItem("one").alias("data"))
  .filter($"data".geq(5))