连接两个字典并将其存储到RDD中

时间:2019-06-28 08:39:11

标签: python apache-spark dictionary rdd

我有一个字典ceil_mode=True,其中包含1748个元素,(仅显示前12个元素)-

users

和另一本字典defaultdict(int, {'470520068': 1, '2176120173': 1, '145087572': 3, '23047147': 1, '526506000': 1, '326311693': 1, '851106379': 4, '161900469': 1, '3222966471': 1, '2562842034': 1, '18658617': 1, '73654065': 4,}) ,其中包含452743个元素(显示前42个元素)-

partition

{'609232972': 4, '975151075': 4, '14247572': 4, '2987788788': 4, '3064695250': 2, '54097674': 3, '510333371': 0, '34150587': 4, '26170001': 0, '1339755391': 3, '419536996': 4, '2558131184': 2, '23068646': 6, '2781517567': 3, '701206260771905541': 4, '754263126': 4, '33799684': 0, '1625984816': 4, '4893416104': 3, '263520530': 3, '60625681': 4, '470528618': 3, '4512063372': 6, '933683112': 3, '402379005': 4, '1015823005': 2, '244673821': 0, '3279677882': 4, '16206240': 4, '3243924564': 6, '2438275574': 6, '205941266': 3, '330723222': 1, '3037002897': 0, '75454729': 0, '3033154947': 6, '67475302': 3, '922914019': 6, '2598199242': 6, '2382444216': 3, '1388012203': 4, '3950452641': 5,} 中的键(都是唯一的)都在users中,并以不同的值重复(而且partition包含一些我们不使用的额外键)。我想要的是一个新字典partition,它将final的键与users的键和partition的值连接起来,即如果我有'145087572 '作为partition中的密钥,并且同一密钥已在users中重复两次或三次,其值分别为: {'145087572':2,'145087572':3 ,'145087572':7} ,那么我应该在新字典partition中获得所有这三个元素。另外,我还必须将此字典存储为键:值RDD。
这是我尝试过的:

final

运行此代码后,我的笔记本电脑停止响应(代码仍显示[*]),我必须重新启动它。我是否可以知道我的代码是否有问题,以及执行此操作的更有效方法。

1 个答案:

答案 0 :(得分:0)

第一个词典不能包含重复键,重复键的值将被相同键的最后一个值覆盖。
现在让我们分析您的代码

user_key=list(users.keys()) # here you get all the keys say(1,2,3)
final=[]
for x in user_key: #you are iterating over the keys so x will be 1, 2, 3
    s={x:partition.get(x) for x in partition} #This is the reason for halting

''' breaking the above line this is what it looks like.
    s = {} 
    for x in partition:
        s[x] = partition.get(x)
     isn't the outer forloop and inner forloop is using the same variable x
     so basically instead of iterating over the keys of users you are 
     iterating over the keys of partition table, 
     as x is updated inside inner foorloop(so x contains the keys of partition 
     table).
     '''
    final.append(s)

现在暂停的原因是(例如,用户词典中有10个键)。
因此外部forloop会迭代10次和10次
内部的forloop将遍历整个分区键并进行复制
这会导致内存错误,最终由于内存不足而导致系统挂起。
我认为这对您有用
将分区数据存储在python defaultdict(list)

from collections import defaultdict
user_key = users.keys()
part_dict = defaultdict(list)
# partition = [[key1, value], [key2, value], ....] 
# store your parition data in this way (list inside list)
for index in parition:
    if index[0] not in part_dict:
        part_dict[index[0]] = index[1]
    else:
        part_dict[index[0]].append(index[1])
# part_dict = {key1:[1,2,3], key2:[1,2,3], key3:[4,5],....}
final = []
for x in user_keys:
   for values in part_dict[x]:
       final.append([x, values])
       # if you want your result of dictionary format(I don't think it's required) then you ca use
       # final.append({x:values})
       # final = [{key1: 1}, {key2: 2}, ....]
# final = [[key1, 1], [key1, 2], [key1, 3], .....]

以上代码未经测试,可能需要进行一些细微更改