有没有办法可以在Pyspark的元组中解包一个元组? 数据如下所示:
[('123', '0001-01-01', '2500-01-01', (26, 'X', 'A', '4724', '4724')), ('123', '0001-01-01', '2500-01-01', (21, 'S', 'A', '8247', '8247'))]
我希望它看起来像:
[('123', '0001-01-01', '2500-01-01', 26, 'X', 'A', '4724', '4724'), ('123', '0001-01-01', '2500-01-01', 21, 'S', 'A', '8247', '8247')]
答案 0 :(得分:2)
def unpack(record):
unpacked_list = []
for obj in record:
if isinstance(obj, tuple):
for obj_elem in obj:
unpacked_list.append(obj_elem)
else:
unpacked_list.append(obj)
return tuple(unpacked_list)
example_rdd = example_rdd.map(unpack)
答案 1 :(得分:0)
试试这个:
example = [('123', '0001-01-01', '2500-01-01', (26, 'X', 'A', '4724', '4724')), ('123', '0001-01-01', '2500-01-01', (21, 'S', 'A', '8247', '8247'))]
[tuple(x[:3]) + (x[3]) for x in example]
结果:
[('123', '0001-01-01', '2500-01-01', 26, 'X', 'A', '4724', '4724'), ('123', '0001-01-01', '2500-01-01', 21, 'S', 'A', '8247', '8247')]
答案 2 :(得分:0)
正如AChampion在评论中所建议的那样,您可以使用map(lambda x: x[:-1] + x[-1])
,如下所示:
data = sc.parallelize([
('123', '0001-01-01', '2500-01-01', (26, 'X', 'A', '4724', '4724')),
('123', '0001-01-01', '2500-01-01', (21, 'S', 'A', '8247', '8247'))
])
data.map(lambda x: x[:-1] + x[-1]).collect()
这给出了:
[('123', '0001-01-01', '2500-01-01', 26, 'X', 'A', '4724', '4724'),
('123', '0001-01-01', '2500-01-01', 21, 'S', 'A', '8247', '8247')]