我的程序需要按键减少列表列表。列代码列表可能很大,所以我需要根据已经去重复序列的列code_deDup进行减少,现在我需要再次根据序列列表进行计数。我可以使用def kvListSum来做工作,但它不优雅,不是python / pyspark精神。
这个问题与PySpark reduceByKey aggregation after collect_list on a column不同,因为这就像获取列code_deDup一样,这是获取code_Dedup列表的计数,这是进一步的步骤。谢谢。
我的代码:
from pyspark import SparkContext,SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import udf
from collections import Counter
def code_deDup(codelist):
temp_list = []
for item in codelist:
if len(temp_list) > 0:
if temp_list[-1][0] != item:
temp_list.append([item,1])
else:
temp_list[-1][1] += 1
else:
temp_list.append([item,1])
return temp_list
def kvListSum(mylist):
#cnt=Counter()
#for key,value in mylist:
# cnt[key] += value
#return cnt
#return mylist[0]
temp_list = []
for item in mylist:
if len(temp_list) > 0:
boolfound = False
for temp in temp_list:
if item[0] == temp[0]:
temp[1] += item[1]
boolfound = True
break
if boolfound == False:
temp_list.append(item)
else:
temp_list.append(item)
return temp_list
schema = ArrayType(StructType([ StructField("char", StringType(), False), StructField("count", IntegerType(), False)]))
code_deDup_udf = udf(code_deDup, schema)
udf2 = udf(kvListSum)
conf = SparkConf().setMaster("local")
conf = conf.setAppName("test")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
rdd = sc.parallelize([('20171127',[ 'AR','AR','CR','CR','BR','BR','AR','AR','CR','CR' ]),\
('20171128',[ 'CR','CR','BR','BR','AR','AR','CR','CR','BR','BR' ])])
df = spark.createDataFrame(rdd, ["date", "codelist"])
df = df.withColumn('code_deDup', code_deDup_udf('codelist')).withColumn('codeSum', udf2('code_deDup'))
#df.printSchema()
df.show(2,False)
输出是:
+ -------- + ------------------------------------- --- + ---------------------------------------- + | date | codelist | code_deDup | + -------- + ---------------------------------------- + ---------------------------------------- + | 20171127 | [AR,AR,CR,CR,BR,BR,AR,AR,CR,CR] | [[AR,2],[CR,2],[BR,2],[AR,2], [CR,2]] | | 20171128 | [CR,CR,BR,BR,AR,AR,CR,CR,BR,BR] | [[CR,2],[BR,2],[AR,2],[CR,2], [BR,2]] | + -------- + ---------------------------------------- + ---------------------------------------- +
我想要一个由'code_deDup'进一步聚合的新列:
[[AR,2], [CR,2], [BR,2], [AR,2], [CR,2]] ---> [AR, 4],[CR,4],[BR,2]
[[CR,2], [BR,2], [AR,2], [CR,2], [BR,2]] ---> [CR, 4],[BR,4],[AR,2]
我可以使用def kvListSum来做工作,但它并不优雅,所以请问你的想法。