我有一本字典,其中的键是字符串元组,值是频率 例如
{('this','is'):2,('some','word'):3....}
我需要消除一些包含那些子键的键,例如:
d={('large','blue'):4,('cute','blue'):3,('large','blue','dog'):2,
('cute','blue','dog'):2,('cute','blue','elephant'):1}
我需要删除('large','blue')
,因为它仅出现在'large blue dog'
中,但是我无法删除“可爱的蓝色”,因为它出现在'cute blue dog'
和'cute blue elephant'
d={('large','blue'):4,('cute','blue'):3,('large','blue','dog'):2,
('cute','blue','dog'):2,('cute','blue','elephant'):1}
final_list=[]
for k,v in d.items():
final_list.append(' '.join(f for f in k))
final_list=sorted(final_list, key=len,reverse=True)
completed=set()
for f in final_list:
if not completed:
completed.add(f)
else:
if sum(f in s for s in completed)==1:
continue
print(final_list)
print(completed)
但这只会给我[我需要的''可爱的蓝色大象']
[large blue dog] :2
[cute blue dog]:2
[cute blue elephant]:1
[cute blue]:3
答案 0 :(得分:1)
更新。如果您也想要计数,我宁愿将大部分代码重写为:
d={('large','blue'):4,('cute','blue'):3,('large','blue','dog'):2,
('cute','blue','dog'):2,('cute','blue','elephant'):1}
completed = {}
for k,v in d.items():
if len([k1 for k1,v1 in d.items() if k != k1 and set(k).issubset(set(k1))]) != 1:
completed[k] = v
print(completed)
结果
{('cute','blue'):3,('large','blue','dog'):2,2,('cute','blue','dog'):2,(' cute”,“ blue”,“ elephant”):1}
我没有检查性能。我就把它留给你。
-
如何更换
for f in final_list:
if not completed:
completed.add(f)
else:
if sum(f in s for s in completed)==1:
continue
使用
for f in final_list:
if len([x for x in final_list if f != x and f in x]) != 1:
completed.add(f)
这是您要寻找的吗?
答案 1 :(得分:0)
这应该有效:
previous = " "
previousCount = 0
for words in sorted([ " ".join(key) for key in d ]) + [" "]:
if words.startswith(previous):
previousCount += 1
else:
print(previous,previousCount)
if previousCount < 2 and previous != " ":
del d[tuple(previous.split(" "))]
previous = words
previousCount = 0
答案 2 :(得分:0)
必须有更有效的方法(非O(n^2)
),但这似乎是您想要的:
input = {
('large','blue'): 4,
('cute','blue'): 3,
('large','blue','dog'): 2,
('cute','blue','dog'): 2,
('cute','blue','elephant'): 1,
}
keys = set(' '.join(k) for k in input)
filtered = {
tuple(f.split())
for f in keys
if sum(f != k and f in k for k in keys) == 1
}
result = {k: v for k, v in input.items() if k not in filtered}
from pprint import pprint
pprint(sorted(result.items()))
结果:
[(('cute', 'blue'), 3),
(('cute', 'blue', 'dog'), 2),
(('cute', 'blue', 'elephant'), 1),
(('large', 'blue', 'dog'), 2)]
根据您的要求,该想法是识别一次出现的键作为其他键的一部分。
答案 3 :(得分:0)
您想保留出现在多个3元组中的2元组吗?我有一个解决方案,可以在3s上循环一次以构建一个哈希表,然后使用它来检查是否每个2在多个3中发生。
特殊出现是指它们以字母作为单词的子字符串出现。
from collections import defaultdict, Counter
d={('large','blue'):4,('cute','blue'):3,('large','blue','dog'):2,
('cute','blue','dog'):2,('cute','blue','elephant'):1}
# partition the tuple keys by length
tab = defaultdict(list)
for key in d:
tab[len(key)].append(key)
# make counts of the 2-tuples in the 3-tuples
fil = Counter(v for key in tab[3] for v in [key[1:],key[:-1]])
# filter the 2-tuples that don't occur in more than one 3-tuple
tab[2] = [key for key in tab[2] if fil.get(key, 0) > 1]
[(' '.join(key), d[key]) for l in tab for key in tab[l]]
结果是:
[('cute blue', 3), ('large blue dog', 2), ('cute blue elephant', 1), ('cute blue dog', 2)]
答案 4 :(得分:0)
尝试一下:
d = {('large', 'blue'): 4,
('cute', 'blue'): 3,
('large', 'blue', 'dog'): 2,
('cute', 'blue', 'dog'): 2,
('cute', 'blue', 'elephant'): 1}
final_list = [(' '.join(k), v) for k, v in sorted(d.items(), key=lambda kv: len(kv[0]))]
final = dict(final_list)
keys = [kv[0] for kv in final_list]
for idx, key in enumerate(keys):
if sum(key in s for s in keys[idx + 1:]) == 1:
del final[key]
print(final)
# {'cute blue': 3,
# 'large blue dog': 2,
# 'cute blue dog': 2,
# 'cute blue elephant': 1}
final_list
,final
和keys
基本上都是d
的排序版本(基于长度)。然后,查找keys
的“违反”元素,并删除final
中的相应键。