pyspark SQL无法绑定数据帧的年龄列

时间:2019-05-16 18:10:45

标签: sql pyspark-sql

我正在阅读教科书中的pysprk-sql脚本,如下所示:

dff = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter","\t").load("/home/hduser/Desktop/allFromDesktop/pyspark/creditData.csv")

Result:
+---+------------------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+---------+
|_c0|            Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married|       Ethnicity|Balance|Age_class|
+---+------------------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+---------+
|  0|14.890999999999998| 3606|   283|    2| 34|       11|  Male|     No|    Yes|       Caucasian|    333|    25-34|
|  1|           106.025| 6645|   483|    3| 82|       15|Female|    Yes|    Yes|           Asian|    903|      65+|
|  2|104.59299999999999| 7075|   514|    4| 71|       11|  Male|     No|     No|           Asian|    580|      65+|

后跟脚本:

tab=dff.select(['Age_class','Balance','Limit']).groupby('Age_class','Limit').agg(F.count('Limit')
    ,F.mean('Limit').alias('Limit_avg'),
    F.min('Limit').alias('Limit_min'),
    F.max('Limit').alias('Limit_max')).withColumn('total',sum(col('Limit')).over(Window)).withColumn('Percent',col('Limit')*100/col('total')).drop(col('total'))
    tab.show()

产生的:::

+---------+-----+------------+---------+---------+---------+-------------------+
|Age_class|Limit|count(Limit)|Limit_avg|Limit_min|Limit_max|            Percent|
+---------+-----+------------+---------+---------+---------+-------------------+
|    45-54| 7838|           1|   7838.0|     7838|     7838| 0.4137807247233719|
|    35-44|  886|           1|    886.0|      886|      886|0.04677337612974069|
|    45-54| 4632|           1|   4632.0|     4632|     4632|  0.244530788073317|
|    55-64| 1448|           1|   1448.0|     1448|     1448|0.07644226708336853|
**Here in this result you can see that column 'Age_class' is not binned/grouped into classes the data type for "Age_col " is String.

在这里,我想澄清一点,即groupby子句中的“ limit”列最初不存在,但是在执行上述脚本期间,我得到了一个错误,即“变量“ Limit”无法解析”,而原始“ Limit”列也被.allias替换/删除,所以最后我使用了groupby('Age_class','Limit')。 执行脚本后,我得到了最终结果,其中变量“ Age_class”未正确分类/分组,因为我期望将其分类为类:

Expected "Age_class" column
    +---------+------------+------------------+---------+---------+
    |Age_class|count(Limit)|         Limit_avg|Limit_min|Limit_max|
    +---------+------------+------------------+---------+---------+
    |    45-54|          65| 4836.630769230769|      855|    11200|
    |      <25|          11|3932.6363636363635|     2120|     6375|
    |    55-64|          68|            4530.0|     1311|    11966|
    |    35-44|          71| 4884.140845070423|      886|    13414|
    |    25-34|          45|            4280.0|      855|     8117|
    |      65+|         140| 4922.757142857143|     1134|    13913|
    +---------+------------+------------------+---------+---------+

0 个答案:

没有答案