Python(Pyspark)嵌套列表reduceByKey,Python列表追加以创建嵌套列表

时间:2018-12-09 20:41:56

标签: python list pyspark nested

我有一个RDD输入,其格式如下:

from Tkinter import *

def login_check():
name = entry_name.get()
password = entry_password.get()
if name == "Marvin":
    if password == "123":
        print("You are now logged in")
    else:
        print("wrong password.")
else:
    print("This username does not exist.")


root = Tk()

label_name = Label(root, text="Name")
label_password = Label(root, text="Password")
entry_name = Entry(root)
entry_password = Entry(root)

label_name.grid(row=0, sticky=E)
label_password.grid(row=1, sticky=E)
entry_name.grid(row=0, column=1)
entry_password.grid(row=1, column=1)


login_check()
root.mainloop()

“ 2002”是关键。因此,我具有以下键值对:

[('2002', ['cougar', 1]),
('2002', ['the', 10]),
('2002', ['network', 4]),
('2002', ['is', 1]),
('2002', ['database', 13])]

Count是整数,我想使用reduceByKey获得以下结果:

 ('year', ['word', count])

要获得上述的嵌套列表,我非常费劲。 主要问题是获取嵌套列表。 例如。我有三个清单a,b和c

[('2002, [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]]')]

将返回一个

a = ['cougar', 1]
b = ['the', 10]
c = ['network', 4]

a.append(b)

 ['cougar', 1, ['the', 10]]

将x返回为

x = []
x.append(a)
x.append(b)

但是,如果这样

  [['cougar', 1], ['the', 10]]

将c返回为

  c.append(x)

以上所有操作均达不到我想要的结果。

我想得到

  ['network', 4, [['cougar', 1], ['the', 10]]]

即嵌套列表应为:

   [('2002', [[word1, c1],[word2, c2], [word3, c3], ...]), 
   ('2003'[[w1, count1],[w2, count2], [w3, count3], ...])]

其中a,b,c本身是包含两个元素的列表。

我希望问题清楚,有什么建议吗?

2 个答案:

答案 0 :(得分:1)

我提出了一种解决方案:

def wagg(a,b):  
    if type(a[0]) == list: 
        if type(b[0]) == list:
            a.extend(b)
        else: 
            a.append(b)
        w = a
    elif type(b[0]) == list: 
        if type(a[0]) == list:
            b.extend(a)
        else:    
            b.append(a)
        w = b
    else: 
        w = []
        w.append(a)
        w.append(b)
    return w  


rdd2 = rdd1.reduceByKey(lambda a,b: wagg(a,b)) 

有人有更好的解决方案吗?

答案 1 :(得分:1)

此问题无需使用ReduceByKey。

  • 定义RDD

rdd = sc.parallelize([('2002', ['cougar', 1]),('2002', ['the', 10]),('2002', ['network', 4]),('2002', ['is', 1]),('2002', ['database', 13])])

  • 使用以下命令查看RDD值 rdd.collect()

[('2002', ['cougar', 1]), ('2002', ['the', 10]), ('2002', ['network', 4]), ('2002', ['is', 1]), ('2002', ['database', 13])]

  • 应用groupByKey函数并将值映射为列表,如您在Apache Spark docs中所见。

rdd_nested = rdd.groupByKey().mapValues(list)

  • 查看RDD分组值 rdd_nested.collect()

[('2002', [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]])]