我正在阅读教科书中的pysprk-sql脚本,如下所示:
dff = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter","\t").load("/home/hduser/Desktop/allFromDesktop/pyspark/creditData.csv")
Result:
+---+------------------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+---------+
|_c0| Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married| Ethnicity|Balance|Age_class|
+---+------------------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+---------+
| 0|14.890999999999998| 3606| 283| 2| 34| 11| Male| No| Yes| Caucasian| 333| 25-34|
| 1| 106.025| 6645| 483| 3| 82| 15|Female| Yes| Yes| Asian| 903| 65+|
| 2|104.59299999999999| 7075| 514| 4| 71| 11| Male| No| No| Asian| 580| 65+|
后跟脚本:
tab=dff.select(['Age_class','Balance','Limit']).groupby('Age_class','Limit').agg(F.count('Limit')
,F.mean('Limit').alias('Limit_avg'),
F.min('Limit').alias('Limit_min'),
F.max('Limit').alias('Limit_max')).withColumn('total',sum(col('Limit')).over(Window)).withColumn('Percent',col('Limit')*100/col('total')).drop(col('total'))
tab.show()
产生的:::
+---------+-----+------------+---------+---------+---------+-------------------+
|Age_class|Limit|count(Limit)|Limit_avg|Limit_min|Limit_max| Percent|
+---------+-----+------------+---------+---------+---------+-------------------+
| 45-54| 7838| 1| 7838.0| 7838| 7838| 0.4137807247233719|
| 35-44| 886| 1| 886.0| 886| 886|0.04677337612974069|
| 45-54| 4632| 1| 4632.0| 4632| 4632| 0.244530788073317|
| 55-64| 1448| 1| 1448.0| 1448| 1448|0.07644226708336853|
**Here in this result you can see that column 'Age_class' is not binned/grouped into classes the data type for "Age_col " is String.
在这里,我想澄清一点,即groupby子句中的“ limit”列最初不存在,但是在执行上述脚本期间,我得到了一个错误,即“变量“ Limit”无法解析”,而原始“ Limit”列也被.allias替换/删除,所以最后我使用了groupby('Age_class','Limit')。 执行脚本后,我得到了最终结果,其中变量“ Age_class”未正确分类/分组,因为我期望将其分类为类:
Expected "Age_class" column
+---------+------------+------------------+---------+---------+
|Age_class|count(Limit)| Limit_avg|Limit_min|Limit_max|
+---------+------------+------------------+---------+---------+
| 45-54| 65| 4836.630769230769| 855| 11200|
| <25| 11|3932.6363636363635| 2120| 6375|
| 55-64| 68| 4530.0| 1311| 11966|
| 35-44| 71| 4884.140845070423| 886| 13414|
| 25-34| 45| 4280.0| 855| 8117|
| 65+| 140| 4922.757142857143| 1134| 13913|
+---------+------------+------------------+---------+---------+