使用pyspark根据长度分组单词

时间:2019-02-17 10:12:53

标签: pyspark

我想使用pyspark根据长度对数据进行分组。

a= sc.parallelize(("number","algebra","int","str","raj"))

期望的输出格式为

(("int","str","raj"),("number"),("algebra"))

1 个答案:

答案 0 :(得分:0)

a= sc.parallelize(("number","algebra","int","str","raj"))
a.collect()
    ['number', 'algebra', 'int', 'str', 'raj']

现在,执行以下步骤以获取最终输出-

# Creating a tuple of the length of the word and the word itself.
a = a.map(lambda x:(len(x),x))

# Grouping by key (which is length of tuple)
a = a.groupByKey().mapValues(lambda x:list(x)).map(lambda x:x[1])
a.collect()
    [['int', 'str', 'raj'], ['number'], ['algebra']]