Question

我有一对rdd结构的1TB记录，我想按键分组我的所有记录，然后只将函数应用于值。

我的代码如下：

rdd = sc.textFile("path").map(lambdal:l.split(";"))
rdd_pair=rdd.map(lambda a: (a[0], a))
rdd_pair.take(3)
#output: [('id_client', ('id_client','time','city')]
#[('1', [('1', '2013/03/12 23:59:59', 'London')]
#[('1', [('1', '2013/12/03 10:43:12', 'Rome')]
#[('1', [('1', '2013/05/01 00:09:59', 'Madrid')]

我想通过id_client对所有记录进行分组，然后仅将函数矩阵应用于值。对于每个键，该函数按照＆＃34; time＆＃34;对元组列表进行排序。然后该函数提取从一个城市到另一个城市的过渡。

grouped=rdd_pair.groupByKey(200)
grouped.take(1)
#output [("1",<pyspark.resultiterable.ResultIterable object at 0x7fc659e0a210)]

def matrix(input):
    output=[]
    input_bag= sorted(input, key=lambda x: x[1], reverse=False)
    loc0 = input_bag[0]
    for loc in input_bag[1:]:
        output.append((loc0[2],loc[2]))
        loc0 = loc
    return output

transition=grouped.mapValues(lambda k: matrix(k)).filter(lambda l: l[1]!=[])

我想要的输出是：

#output transition: [('1', [('London', 'Madrid'),('Madrid', 'Rome')])]

我收到了一条Python错误：列表索引超出范围错误

有人可以帮助我吗？感谢

Answer 1

我以这种方式解决了：

def matrix(input):
    output=[]
    input2=[i[0] for i in input]
    input_bag= sorted(input2, key=lambda x: x[1], reverse=False)
    loc0 = input_bag[0]
    for loc in input_bag[1:]:
        output.append((loc0[2],loc[2]))
        loc0 = loc
    return output

在使用Python in-bulit功能之前＆＃34;排序＆＃34;我在input2（元组列表）

中转换输入（可迭代对象）

如何使用pyspark.resultiterable.ResultIterable对象

1 个答案: