我有一个字典ceil_mode=True
,其中包含1748个元素,(仅显示前12个元素)-
users
和另一本字典defaultdict(int,
{'470520068': 1,
'2176120173': 1,
'145087572': 3,
'23047147': 1,
'526506000': 1,
'326311693': 1,
'851106379': 4,
'161900469': 1,
'3222966471': 1,
'2562842034': 1,
'18658617': 1,
'73654065': 4,})
,其中包含452743个元素(显示前42个元素)-
partition
{'609232972': 4,
'975151075': 4,
'14247572': 4,
'2987788788': 4,
'3064695250': 2,
'54097674': 3,
'510333371': 0,
'34150587': 4,
'26170001': 0,
'1339755391': 3,
'419536996': 4,
'2558131184': 2,
'23068646': 6,
'2781517567': 3,
'701206260771905541': 4,
'754263126': 4,
'33799684': 0,
'1625984816': 4,
'4893416104': 3,
'263520530': 3,
'60625681': 4,
'470528618': 3,
'4512063372': 6,
'933683112': 3,
'402379005': 4,
'1015823005': 2,
'244673821': 0,
'3279677882': 4,
'16206240': 4,
'3243924564': 6,
'2438275574': 6,
'205941266': 3,
'330723222': 1,
'3037002897': 0,
'75454729': 0,
'3033154947': 6,
'67475302': 3,
'922914019': 6,
'2598199242': 6,
'2382444216': 3,
'1388012203': 4,
'3950452641': 5,}
中的键(都是唯一的)都在users
中,并以不同的值重复(而且partition
包含一些我们不使用的额外键)。我想要的是一个新字典partition
,它将final
的键与users
的键和partition
的值连接起来,即如果我有'145087572 '作为partition
中的密钥,并且同一密钥已在users
中重复两次或三次,其值分别为: {'145087572':2,'145087572':3 ,'145087572':7} ,那么我应该在新字典partition
中获得所有这三个元素。另外,我还必须将此字典存储为键:值RDD。
这是我尝试过的:
final
运行此代码后,我的笔记本电脑停止响应(代码仍显示[*]),我必须重新启动它。我是否可以知道我的代码是否有问题,以及执行此操作的更有效方法。
答案 0 :(得分:0)
第一个词典不能包含重复键,重复键的值将被相同键的最后一个值覆盖。
现在让我们分析您的代码
user_key=list(users.keys()) # here you get all the keys say(1,2,3)
final=[]
for x in user_key: #you are iterating over the keys so x will be 1, 2, 3
s={x:partition.get(x) for x in partition} #This is the reason for halting
''' breaking the above line this is what it looks like.
s = {}
for x in partition:
s[x] = partition.get(x)
isn't the outer forloop and inner forloop is using the same variable x
so basically instead of iterating over the keys of users you are
iterating over the keys of partition table,
as x is updated inside inner foorloop(so x contains the keys of partition
table).
'''
final.append(s)
现在暂停的原因是(例如,用户词典中有10个键)。
因此外部forloop会迭代10次和10次
内部的forloop将遍历整个分区键并进行复制
这会导致内存错误,最终由于内存不足而导致系统挂起。
我认为这对您有用
将分区数据存储在python defaultdict(list)
from collections import defaultdict
user_key = users.keys()
part_dict = defaultdict(list)
# partition = [[key1, value], [key2, value], ....]
# store your parition data in this way (list inside list)
for index in parition:
if index[0] not in part_dict:
part_dict[index[0]] = index[1]
else:
part_dict[index[0]].append(index[1])
# part_dict = {key1:[1,2,3], key2:[1,2,3], key3:[4,5],....}
final = []
for x in user_keys:
for values in part_dict[x]:
final.append([x, values])
# if you want your result of dictionary format(I don't think it's required) then you ca use
# final.append({x:values})
# final = [{key1: 1}, {key2: 2}, ....]
# final = [[key1, 1], [key1, 2], [key1, 3], .....]
以上代码未经测试,可能需要进行一些细微更改