我正在尝试从数据框(scala语言)中选择几列。 问题是,我无法将所有列放入单个字符串并传递给DataFrame的select函数。我尝试了以下但不起作用:
scala> val str1 = "sum(\"bal1\")/100,"
str1: String = sum("bal1")/100,
scala> val str2 = "sum(\"bal12\")/100,"
str1: String = sum("bal2")/100,
scala> val str3 = str1.concat(str2)
str3: String = sum("bal1")/100,sum("bal2")/100
peopleDataFrame.select(str3).show // Throws AnalysisException as mentioned below
scala> peopleDataFrame.select(str3).show
org.apache.spark.sql.AnalysisException: cannot resolve 'sum("bal1")/100,sum("bal2")/100' given input columns name, bal1, bal2;
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("hdfs://quickstart.cloudera:8020/user/sekar/1.txt")
val schemaString = "name,bal1,bal2"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,IntegerType};
val schema =
StructType(
schemaString.split(",").map(fieldName => StructField(fieldName, IntegerType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0).toString, p(1).toInt, p(2).toInt))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
scala> val str1 = "sum(\"bal1\")/100,"
str1: String = sum("bal1")/100,
scala> val str2 = "sum(\"bal12\")/100,"
str1: String = sum("bal2")/100,
scala> val str3 = str1.concat(str2)
str3: String = sum("bal1")/100,sum("bal2")/100
peopleDataFrame.select(str3).show
str3正确解析为' sum(" bal1")/ 100,sum(" bal2")/ 100'。请告诉我如何解决AnalysiseException。
如果需要更多信息,请与我们联系。提前谢谢。
答案 0 :(得分:0)
Spark API不支持在同一个字符串中一次传递多个表达式。
此外,输入的某些部分也应该更改:
尽管如此,以下至少可以通过两种不同的方式完成:
1)更换'选择'与' selectExpr'方法,并分别传递每个投影。 例如:
peopleDataFrame.selectExpr("sum(bal1) / 100", "sum(bal2) / 100").show
有关详细信息,请参阅DataFrame API中的selectExpr方法: https://spark.apache.org/docs/1.6.1/api/scala/#org.apache.spark.sql.DataFrame
2)将数据框注册为临时表,并直接执行SQL(如果投影来自外部,这可能很有用)
peopleDataFrame.registerTempTable("peopleDataFrame")
sqlContext.sql("SELECT sum(bal1) / 100, sum(bal2) / 100 FROM peopleDataFrame").show()