我有一张这样的表
+------+------------+
| fruit|fruit_number|
+------+------------+
| apple| 20|
|orange| 33|
| pear| 27|
| melon| 31|
| plum| 8|
|banana| 4|
+------+------------+
我想生成一个像这样的表
|fruit_number_range| number of types of fruit|
|less than 5 | 1 |
|less than 25 | 3 |
|more than 25 | 2 |
我想知道是否有办法根据一列的间隔值生成新列。
以下是我如何生成水果表的代码:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext,Row
sqlContext = HiveContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType, StructField,LongType
from pyspark.sql.functions import sum, mean,col
rdd = sc.parallelize([('apple', 20),
('orange',33),
('pear',27),
('melon',31),
('plum',8),
('banana',4)])
schema = StructType([StructField('fruit', StringType(), True),
StructField('fruit_number', IntegerType(),True)])
df = sqlContext.createDataFrame(rdd, schema)
答案 0 :(得分:1)
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext,Row
from pyspark.sql.types import StringType, IntegerType, StructType, StructField,LongType
from pyspark.sql.functions import sum, mean,col, udf
sc = SparkContext()
sqlContext = HiveContext(sc)
rdd = sc.parallelize([('apple', 20),
('orange',33),
('pear',27),
('melon',31),
('plum',8),
('banana',4)])
schema = StructType([StructField('fruit', StringType(), True),
StructField('fruit_number', IntegerType(),True)])
df = sqlContext.createDataFrame(rdd, schema)
def fruit_number_range(fruit_number):
if fruit_number < 5:
return 'less than 5'
elif fruit_number < 25:
return 'less than 25'
return 'more than 25'
udf_fruit_number_range = udf(fruit_number_range, StringType())
df_w_range = df.withColumn("fruit_number_range", udf_fruit_number_range("fruit_number"))
df_w_range.groupBy("fruit_number_range").count().show()
结果
+------------------+-----+
|fruit_number_range|count|
+------------------+-----+
| less than 25| 2|
| less than 5| 1|
| more than 25| 3|
+------------------+-----+