我有以下数据(仅显示摘录)
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
我将inferSchema
选项设置为true
,然后将describe
列读入。看起来不错。
scala> val data = spark.read.option("header", "true").option("inferSchema","true").csv("./data/flight-data/csv/2015-summary.csv")
scala> data.describe().show()
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-------+-----------------+-------------------+------------------+
| count| 256| 256| 256|
| mean| null| null| 1770.765625|
| stddev| null| null|23126.516918551915|
| min| Algeria| Angola| 1|
| max| Zambia| Vietnam| 370002|
+-------+-----------------+-------------------+------------------+
如果我未指定inferSchema
,则所有列均被视为字符串。
scala> val dataNoSchema = spark.read.option("header", "true").csv("./data/flight-data/csv/2015-summary.csv")
dataNoSchema: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> dataNoSchema.printSchema
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: string (nullable = true)
问题1)为什么Spark
给出最后一列mean
的{{1}}和stddev
值
count
问题2)如果scala> dataNoSchema.describe().show();
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-------+-----------------+-------------------+------------------+
| count| 256| 256| 256|
| mean| null| null| 1770.765625|
| stddev| null| null|23126.516918551915|
| min| Algeria| Angola| 1|
| max| Zambia| Vietnam| 986|
+-------+-----------------+-------------------+------------------+
现在将Spark
解释为count
列,那么为什么numeric
的值是986而不是37002(就像在Data DataFrame中一样)
答案 0 :(得分:0)
Spark SQL渴望符合SQL标准,因此使用相同的评估规则,并在需要时透明地强制类型满足表达式(例如,参见my answer至PySpark DataFrames - filtering using comparisons between columns of different types)。
这意味着max
和mean
/ stddev
的情况根本不相等:
最大值对于字符串(使用lexicographic ordering)是有意义的,不需要强制。
Seq.empty[String].toDF("count").agg(max("count")).explain
== Physical Plan ==
SortAggregate(key=[], functions=[max(count#69)])
+- Exchange SinglePartition
+- SortAggregate(key=[], functions=[partial_max(count#69)])
+- LocalTableScan <empty>, [count#69]
没有平均值或标准偏差,并且参数强制转换为double
Seq.empty[String].toDF("count").agg(mean("count")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[avg(cast(count#81 as double))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(count#81 as double))])
+- LocalTableScan <empty>, [count#81].