Question

我有以下数据（仅显示摘录）

DEST_COUNTRY_NAME   ORIGIN_COUNTRY_NAME count
United States   Romania 15
United States   Croatia 1
United States   Ireland 344
Egypt   United States   15

我将inferSchema选项设置为true，然后将describe列读入。看起来不错。

scala> val data = spark.read.option("header", "true").option("inferSchema","true").csv("./data/flight-data/csv/2015-summary.csv")
scala> data.describe().show()
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    max|           Zambia|            Vietnam|            370002|
+-------+-----------------+-------------------+------------------+

如果我未指定inferSchema，则所有列均被视为字符串。

scala> val dataNoSchema = spark.read.option("header", "true").csv("./data/flight-data/csv/2015-summary.csv")
dataNoSchema: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> dataNoSchema.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: string (nullable = true)

问题1）为什么Spark给出最后一列mean的{{1}}和stddev值

count

Answer 1

Spark SQL渴望符合SQL标准，因此使用相同的评估规则，并在需要时透明地强制类型满足表达式（例如，参见my answer至PySpark DataFrames - filtering using comparisons between columns of different types）。

这意味着max和mean / stddev的情况根本不相等：

最大值对于字符串（使用lexicographic ordering）是有意义的，不需要强制。

Seq.empty[String].toDF("count").agg(max("count")).explain

== Physical Plan ==
SortAggregate(key=[], functions=[max(count#69)])
+- Exchange SinglePartition
   +- SortAggregate(key=[], functions=[partial_max(count#69)])
      +- LocalTableScan <empty>, [count#69]

没有平均值或标准偏差，并且参数强制转换为double

Seq.empty[String].toDF("count").agg(mean("count")).explain

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[avg(cast(count#81 as double))])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(count#81 as double))])
      +- LocalTableScan <empty>, [count#81].

Spark如何计算字符串列的均值和标准差

1 个答案: