使用来自主人和的最新更改重建 2.2.0-SNAPSHOT ,而不会<{em>} def schema
Dataset
对$ ./bin/spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0-SNAPSHOT
/_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
Branch master
Compiled by user jacek on 2017-03-27T19:00:06Z
Revision 3fada2f502107bd5572fb895471943de7b2c38e4
Url https://github.com/apache/spark.git
Type --help for more information.
scala> spark.range(1).printSchema
root
|-- id: long (nullable = false)
scala> spark.range(1).selectExpr("*").printSchema
root
|-- id: long (nullable = false)
的本地更改。有用。对不起噪音:(
selectExpr
在玩id
(在今天的主人的 2.2.0-SNAPSHOT 中)时,我注意到架构已更改为包含spark-shell
列。我似乎无法解释它。任何人吗?
我可以通过执行以下操作每次启动scala> spark.version
res0: String = 2.2.0-SNAPSHOT
scala> spark.range(1).printSchema
root
|-- value: long (nullable = true)
scala> spark.range(1).explain(true)
== Parsed Logical Plan ==
Range (0, 1, step=1, splits=Some(8))
== Analyzed Logical Plan ==
id: bigint
Range (0, 1, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 1, step=1, splits=Some(8))
== Physical Plan ==
*Range (0, 1, step=1, splits=Some(8))
scala> spark.range(1).printSchema
root
|-- value: long (nullable = true)
scala> spark.range(1).selectExpr("*").printSchema
root
|-- id: long (nullable = false)
scala> val rangeDS = spark.range(1)
rangeDS: org.apache.spark.sql.Dataset[Long] = [value: bigint]
scala> rangeDS.selectExpr("*").printSchema
root
|-- id: long (nullable = false)
时重现它:
$ ./bin/spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0-SNAPSHOT
/_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
Branch master
Compiled by user jacek on 2017-03-27T03:43:09Z
Revision 3fbf0a5f9297f438bc92db11f106d4a0ae568613
Url https://github.com/apache/spark.git
Type --help for more information.
P.S。看起来似乎无法在 2.1.0 中重现它。
self.decks
答案 0 :(得分:0)
我会说答案在于source code,对于每个&#34;表达&#34;您传入selectExpr
时,该函数会创建一个新列,然后添加原始列:
def selectExpr(exprs: String*): DataFrame = {
select(exprs.map { expr =>
Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
}: _*)
}
如果你看看上面的选择是什么:
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
您看到它连接了从SQL表达式中获取的新列,并创建了一个包含它们的新数据框,以及原始数据框中的那些
编辑我尝试使用2.2.0并获得:
res7: String = 2.2.0
root
|-- id: long (nullable = false)
root
|-- id: long (nullable = false)