pyspark基于列的间隔值生成新列,以生成数据桶

时间:2016-10-22 04:09:13

标签: pyspark intervals

我有一张这样的表

+------+------------+
| fruit|fruit_number|
+------+------------+
| apple|          20|
|orange|          33|
|  pear|          27|
| melon|          31|
|  plum|           8|
|banana|           4|
+------+------------+

我想生成一个像这样的表

    |fruit_number_range|  number of types of fruit|
    |less than 5       |   1                      |
    |less than 25      |   3                      |
    |more than 25      |   2                      |

我想知道是否有办法根据一列的间隔值生成新列。

以下是我如何生成水果表的代码:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext,Row
sqlContext = HiveContext(sc)
from pyspark.sql.types import StringType, IntegerType,       StructType, StructField,LongType
from pyspark.sql.functions import sum, mean,col


rdd = sc.parallelize([('apple', 20),
('orange',33),
('pear',27),
('melon',31),
('plum',8),
('banana',4)])
schema = StructType([StructField('fruit', StringType(), True),
             StructField('fruit_number', IntegerType(),True)])
df = sqlContext.createDataFrame(rdd, schema)

1 个答案:

答案 0 :(得分:1)

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext,Row
from pyspark.sql.types import StringType, IntegerType, StructType, StructField,LongType
from pyspark.sql.functions import sum, mean,col, udf

sc = SparkContext()
sqlContext = HiveContext(sc)

rdd = sc.parallelize([('apple', 20),
                      ('orange',33),
                      ('pear',27),
                      ('melon',31),
                      ('plum',8),
                      ('banana',4)])
schema = StructType([StructField('fruit', StringType(), True),
                     StructField('fruit_number', IntegerType(),True)])

df = sqlContext.createDataFrame(rdd, schema)

def fruit_number_range(fruit_number):
    if fruit_number < 5:
        return 'less than 5'
    elif fruit_number < 25:
        return 'less than 25'
    return 'more than 25'

udf_fruit_number_range = udf(fruit_number_range, StringType())
df_w_range = df.withColumn("fruit_number_range", udf_fruit_number_range("fruit_number"))

df_w_range.groupBy("fruit_number_range").count().show()

结果

+------------------+-----+
|fruit_number_range|count|
+------------------+-----+
|      less than 25|    2|
|       less than 5|    1|
|      more than 25|    3|
+------------------+-----+