Question

我有一个RDD输入，其格式如下：

from Tkinter import *

def login_check():
name = entry_name.get()
password = entry_password.get()
if name == "Marvin":
    if password == "123":
        print("You are now logged in")
    else:
        print("wrong password.")
else:
    print("This username does not exist.")


root = Tk()

label_name = Label(root, text="Name")
label_password = Label(root, text="Password")
entry_name = Entry(root)
entry_password = Entry(root)

label_name.grid(row=0, sticky=E)
label_password.grid(row=1, sticky=E)
entry_name.grid(row=0, column=1)
entry_password.grid(row=1, column=1)


login_check()
root.mainloop()

“ 2002”是关键。因此，我具有以下键值对：

[('2002', ['cougar', 1]),
('2002', ['the', 10]),
('2002', ['network', 4]),
('2002', ['is', 1]),
('2002', ['database', 13])]

Count是整数，我想使用reduceByKey获得以下结果：

 ('year', ['word', count])

要获得上述的嵌套列表，我非常费劲。主要问题是获取嵌套列表。例如。我有三个清单a，b和c

[('2002, [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]]')]

将返回一个

a = ['cougar', 1]
b = ['the', 10]
c = ['network', 4]

a.append(b)

和

 ['cougar', 1, ['the', 10]]

将x返回为

x = []
x.append(a)
x.append(b)

但是，如果这样

  [['cougar', 1], ['the', 10]]

将c返回为

  c.append(x)

以上所有操作均达不到我想要的结果。

我想得到

  ['network', 4, [['cougar', 1], ['the', 10]]]

即嵌套列表应为：

   [('2002', [[word1, c1],[word2, c2], [word3, c3], ...]), 
   ('2003'[[w1, count1],[w2, count2], [w3, count3], ...])]

其中a，b，c本身是包含两个元素的列表。

我希望问题清楚，有什么建议吗？

Answer 1

我提出了一种解决方案：

def wagg(a,b):  
    if type(a[0]) == list: 
        if type(b[0]) == list:
            a.extend(b)
        else: 
            a.append(b)
        w = a
    elif type(b[0]) == list: 
        if type(a[0]) == list:
            b.extend(a)
        else:    
            b.append(a)
        w = b
    else: 
        w = []
        w.append(a)
        w.append(b)
    return w  


rdd2 = rdd1.reduceByKey(lambda a,b: wagg(a,b))

有人有更好的解决方案吗？

Answer 2

此问题无需使用ReduceByKey。

定义RDD

rdd = sc.parallelize([('2002', ['cougar', 1]),('2002', ['the', 10]),('2002', ['network', 4]),('2002', ['is', 1]),('2002', ['database', 13])])

使用以下命令查看RDD值 rdd.collect()：

[('2002', ['cougar', 1]), ('2002', ['the', 10]), ('2002', ['network', 4]), ('2002', ['is', 1]), ('2002', ['database', 13])]

应用groupByKey函数并将值映射为列表，如您在Apache Spark docs中所见。

rdd_nested = rdd.groupByKey().mapValues(list)

查看RDD分组值 rdd_nested.collect()：

[('2002', [['cougar', 1], ['the', 10], ['network', 4], ['is', 1], ['database', 13]])]

Python（Pyspark）嵌套列表reduceByKey，Python列表追加以创建嵌套列表

2 个答案: