pyspark如何通过键列表减少

时间:2017-11-27 09:04:49

标签: python apache-spark pyspark spark-dataframe user-defined-functions

我的程序需要按键减少列表列表。列代码列表可能很大,所以我需要根据已经去重复序列的列code_deDup进行减少,现在我需要再次根据序列列表进行计数。我可以使用def kvListSum来做工作,但它不优雅,不是python / pyspark精神。

这个问题与PySpark reduceByKey aggregation after collect_list on a column不同,因为这就像获取列code_deDup一样,这是获取code_Dedup列表的计数,这是进一步的步骤。谢谢。

我的代码:

from pyspark import SparkContext,SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import udf
from collections import Counter
def code_deDup(codelist):
    temp_list = []
    for item in codelist:   
        if len(temp_list) > 0:
            if temp_list[-1][0] != item:
                temp_list.append([item,1])
            else:
                temp_list[-1][1] += 1
        else:
            temp_list.append([item,1])
    return temp_list

def kvListSum(mylist):
    #cnt=Counter()
    #for key,value in mylist:
    #    cnt[key] += value
    #return cnt
    #return mylist[0]
    temp_list = []
    for item in mylist:
        if len(temp_list) > 0:
            boolfound = False
            for temp in temp_list:
                if item[0] == temp[0]:
                    temp[1] += item[1]
                    boolfound = True
                    break
            if boolfound == False:
                temp_list.append(item)    
        else:
            temp_list.append(item)
    return temp_list

schema = ArrayType(StructType([    StructField("char", StringType(), False),    StructField("count", IntegerType(), False)]))
code_deDup_udf = udf(code_deDup, schema)
udf2 = udf(kvListSum)
conf = SparkConf().setMaster("local")
conf = conf.setAppName("test")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
rdd = sc.parallelize([('20171127',[ 'AR','AR','CR','CR','BR','BR','AR','AR','CR','CR'   ]),\
                      ('20171128',[ 'CR','CR','BR','BR','AR','AR','CR','CR','BR','BR'   ])]) 
df = spark.createDataFrame(rdd, ["date", "codelist"])
df = df.withColumn('code_deDup', code_deDup_udf('codelist')).withColumn('codeSum', udf2('code_deDup'))
#df.printSchema()
df.show(2,False)

输出是:

+ -------- + ------------------------------------- --- + ---------------------------------------- + | date | codelist | code_deDup | + -------- + ---------------------------------------- + ---------------------------------------- + | 20171127 | [AR,AR,CR,CR,BR,BR,AR,AR,CR,CR] | [[AR,2],[CR,2],[BR,2],[AR,2], [CR,2]] | | 20171128 | [CR,CR,BR,BR,AR,AR,CR,CR,BR,BR] | [[CR,2],[BR,2],[AR,2],[CR,2], [BR,2]] | + -------- + ---------------------------------------- + ---------------------------------------- +

我想要一个由'code_deDup'进一步聚合的新列:

[[AR,2], [CR,2], [BR,2], [AR,2], [CR,2]] --->  [AR, 4],[CR,4],[BR,2]
[[CR,2], [BR,2], [AR,2], [CR,2], [BR,2]] --->  [CR, 4],[BR,4],[AR,2]

我可以使用def kvListSum来做工作,但它并不优雅,所以请问你的想法。

0 个答案:

没有答案