我有一个50K的字符串列表(城市名称),并且我需要一个最小的字符三元组(可能是n-gram)列表,其中每个字符串至少被一个三元组命中一次。考虑以下列表: [“阿姆斯特丹”,“鹿特丹”,“哈勒姆”,“乌得勒支”,“格罗宁根”]
可识别的字母组合的列表长4,并且应该是(可选):
['ter', 'haa', 'utr', 'gro']
我认为我的解决方案可以找到正确的正确答案,但是在其他列表上使用时却给出了错误的答案。
from collections import Counter
def identifying_grams(list, n=3):
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def ngrams(text, n=3):
return [text[i:i + n] for i in range(len(text) - n + 1)]
hits = []
trigrams = []
for item in list:
# trigrams += ngrams(item)
trigrams += f7(ngrams(item))
counts = Counter(trigrams).most_common()
for trigram, count in counts:
items = []
for item in list:
if trigram in item:
hits.append(trigram)
items.append(item)
for i in items:
list.remove(i)
return(f7(hits))
list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
print(identifying_grams(list1))
# Good, we get: ['ter', 'haa', 'utr', 'gro']
list2 = ['amsterdam','schiedam']
print(identifying_grams(list2))
# Good, we get: ['dam']
list3 = ['amsterdam','schiedam','terwolde','wolstad']
print(identifying_grams(list3))
# Ouch, we get: ['ter', 'dam', 'wol']
# this should be ['dam', 'wol'] as this is only 2 trigrams that identify the list...
到目前为止,我有两个答案,但是它们都有缺陷。 Rupesh的一个适用于小于10个项目的列表。我的清单中有超过5万个商品。尽管不是完美的解决方案,但来自mujjiga的解决方案确实提出了解决方案。
Python Ninja的赏金,他提出了一个可扩展的完美解决方案。 奖励kuddos,如果它运行良好并在每次运行时都给出相同的解决方案!
答案 0 :(得分:8)
这是@mujjiga答案的理论分析:
您可以创建共享相同ngram的单词类。您希望选择覆盖整个单词集的那些类中的最小数量(即ngram的最小数量)。这是set cover problem。不幸的是,这个问题是NP难题(不是 NP完整,感谢@mujjiga )。 (编辑:因此,没有已知的解决方案可以在合理的时间内为您提供预期的结果。。)贪婪算法几乎是最好的解决方案(请参见https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem)。
请注意,即使是贪婪算法也可能给出奇怪的结果。取集合{a, b}, {b, c}, {c, d}
和超集{a, b, c, d}
。这三个子集最大。如果首先使用{b, c}
,则需要另外两个子集来覆盖超集。如果您使用{a, b}
或{c, d}
,那么两个子集就足够了。
让我们使用贪婪算法,并考虑实现。创建将ngram映射到单词的字典的代码非常简单:
all_words= ['amsterdam','schiedam','werkendam','amstelveen','schiebroek','werkstad','den haag','rotjeknor','gouda']
n=3
words_by_ngram = {}
for word in all_words:
for ngram in (word[i:i+n] for i in range(0, len(word)-n+1)):
words_by_ngram.setdefault(ngram, set()).add(word)
如果键setdefault
存在,则get
等效于ngram
,否则创建空白集。这是O(|all_words|*|len max word|)
的复杂性。
现在,我们希望采用单词最多的ngram并将这些单词从词典中删除。重复直到得到想要的单词。
这是简单的版本:
s = set(all_words) # the target
gs = set()
d = words_by_ngram.copy() # for the display
while s:
# take the the best ngram
ngram, words = max(d.items(), key=lambda i: len(i[1])) # sort on word count
# remove the words from the dictionary and delete the ngrams whose words have been already found
d = {k:v for k, v in ((k, v - words) for k, v in d.items()) if len(v)}
gs.add(ngram) # add the ngram to the result
s -= words # remove the words from the target
# check
assert set().union(*[words_by_ngram[g] for g in gs]) == set(all_words)
# display
for g in gs:
print("{} -> {}".format(g, words_by_ngram[g]))
输出:
ams -> {'amstelveen', 'amsterdam'}
gou -> {'gouda'}
wer -> {'werkstad', 'werkendam'}
rot -> {'rotjeknor'}
dam -> {'amsterdam', 'werkendam', 'schiedam'}
sch -> {'schiebroek', 'schiedam'}
den -> {'den haag'}
由于要查找字典的最大值和更新的循环,因此第二步的操作复杂度为O(|all_words|*|ngrams|)
。因此,总体复杂度为O(|all_words|*|ngrams|)
使用优先级队列可以降低复杂性。检索最佳ngram的开销为O(1)
,但是更新映射到ngram的单词的len
的优先级为O(lg |ngrams|)
:
import heapq
class PriorityQueue:
"""Adapted from https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes
A prority of 1 invalidates the entries
"""
def __init__(self, words_by_ngram):
self._d = {ngram:[-len(words), (ngram, words)] for ngram, words in words_by_ngram.items()}
self._pq = list(self._d.values())
heapq.heapify(self._pq)
def pop(self):
"""get the ngram, words tuple with the max word count"""
minus_len, (ngram, words) = heapq.heappop(self._pq)
while minus_len == 1: # entry is not valid
minus_len, (ngram, words) = heapq.heappop(self._pq)
return ngram, words
def update(self, ngram, words_to_remove):
"""remove the words from the sets and update priorities"""
del self._d[ngram]
ngrams_to_inspect = set(word[i:i+n] for i in range(0, len(word)-n+1)
for word in words_to_remove)
for ngram in ngrams_to_inspect:
if ngram not in self._d: continue
self._d[ngram][0] = 1 # use the reference to invalidate the entry
[L, (ngram, words)] = self._d[ngram]
words -= words_to_remove
if words:
self._d[ngram] = [-len(words), (ngram, words)] # new entry
heapq.heappush(self._pq, self._d[ngram]) # add to the pq (O(lg ngrams))
else: # nothing left: remove it from dict
del self._d[ngram]
pq = PriorityQueue(words_by_ngram)
gs = set()
s = set(all_words) # the target
while s:
# take the the best ngram
ngram, words = pq.pop()
gs.add(ngram) # add the ngram to the result
s -= words # remove the words from the target
# remove the words from the dictionary and update priorities
pq.update(ngram, words)
使用此代码,总体优先级降至O(|all_words|*|lg ngrams|)
。话虽如此,我很想知道这是否比拥有50k项的朴素的以前版本要快。
答案 1 :(得分:6)
下面是贪婪算法的实现,用于集合覆盖。它在我的机器上运行50,000个英语词典单词时大约需要半秒钟。输出并不总是最佳的,但实际上通常是接近的。您可能可以使用用于整数编程的外部库来解决实例问题,但我不知道您是否想朝那个方向发展。
下面的代码动态维护ngram和未覆盖文本的二分图。一个微妙的地方是,由于Python在其标准库中缺少侵入堆,因此我利用了这样的事实,即密钥只会增加为伪造的密钥。每个ngram在堆中的分数小于或等于应有的分数。当我们拉出最小值时,如果它小于应该的最小值,我们将其与更新后的值放回去。否则,我们知道这是真正的最小值。
此代码应产生确定的输出。在每个步骤中,它选择的词典最小ngram覆盖了未发现文本的最大数量。
import collections
import heapq
def ngrams_from_text(text, n):
return {text[i:i + n] for i in range(len(text) - n + 1)}
def greedy_ngram_cover(texts, n):
neighbors_of_text = {text: ngrams_from_text(text, n) for text in texts}
neighbors_of_ngram = collections.defaultdict(set)
for text, ngrams in neighbors_of_text.items():
for ngram in ngrams:
neighbors_of_ngram[ngram].add(text)
heap = [(-len(neighbors), ngram)
for (ngram, neighbors) in neighbors_of_ngram.items()]
heapq.heapify(heap)
cover = []
while neighbors_of_text:
score, ngram = heapq.heappop(heap)
neighbors = neighbors_of_ngram[ngram]
if score != -len(neighbors):
heapq.heappush(heap, (-len(neighbors), ngram))
continue
cover.append(ngram)
for neighbor in list(neighbors):
for neighbor_of_neighbor in neighbors_of_text[neighbor]:
neighbors_of_ngram[neighbor_of_neighbor].remove(neighbor)
del neighbors_of_text[neighbor]
return cover
print(
greedy_ngram_cover(
['amsterdam', 'rotterdam', 'haarlem', 'utrecht', 'groningen'], 3))
答案 2 :(得分:3)
上述解决方案失败是因为
Counter
以无序的方式返回三字母组,因此,如果您多次运行解决方案,您也会随机获得所需的解决方案在这里,我按照从最低到最高的顺序包含三元组列表,然后在找到解决方案后立即返回。
from itertools import permutations
def checkTrigramsPresentInList(trigrams_list,input_list):
for input_string in input_list:
flag = False
for trigram in trigrams_list:
if trigram in input_string:
flag = True
if not flag:
return False
return True
def ngrams(text, n=3):
return [text[i:i + n] for i in range(len(text) - n + 1)]
def identifying_grams(input_list, n=3):
trigrams = []
for item in input_list:
trigrams += ngrams(item)
len_of_trigrams = len(trigrams)
trigrams_unique = list(set(trigrams))
idx =1
correct_tri_lists = []
unique_trigrams_list = []
while idx <= len_of_trigrams:
trigrams_lists = permutations(trigrams_unique,idx)
for trigrams_list in trigrams_lists:
print(trigrams_list)
if not trigrams_list in unique_trigrams_list:
if checkTrigramsPresentInList(list(trigrams_list),input_list):
correct_tri_lists.append(list(trigrams_list))
##Uncomment below lines if only one combination is needed
if correct_tri_lists:
return correct_tri_lists
unique_trigrams_list.append(trigrams_list)
idx = idx+1
list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
print(identifying_grams(list1))
# # Good, we get: ['ter', 'haa', 'utr', 'gro']
list2 = ['amsterdam','schiedam']
print(identifying_grams(list2))
# # Good, we get: ['dam']
list3 = ['amsterdam','schiedam','terwolde','wolstad']
print(identifying_grams(list3))
答案 3 :(得分:2)
from nltk.util import ngrams
def load_dictonary(cities, n=3):
ngram2cities = {}
for city in cities:
grams = [''.join(x) for x in ngrams(city,n)]
for g in grams:
if g in ngram2cities and city not in ngram2cities[g]:
ngram2cities[g].append(city)
else:
ngram2cities[g] = [city]
return ngram2cities
def get_max(the_dict):
n = 0
the_max_key = None
for key in the_dict :
size = len(the_dict[key])
if size > n:
n = size
the_max_key = key
return the_max_key
def get_min_ngrams(cities, n=3):
selected_ngrams = list()
ngram2cities = load_dictonary(cities, n)
ngram = get_max(ngram2cities)
while ngram is not None:
cities_covered = ngram2cities[ngram]
ngram2cities.pop(ngram)
selected_ngrams.append(ngram)
for city in cities_covered:
for n in ngram2cities:
if city in ngram2cities[n]:
ngram2cities[n].remove(city)
ngram = get_max(ngram2cities)
return selected_ngrams
cities_1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
cities_2 = ['amsterdam','schiedam','terwolde','wolstad']
cities_3 = ['amsterdam','schiedam']
cities_4 = ['amsterdam','walwalwalwaldam']
print (get_min_ngrams(cities_1))
print (get_min_ngrams(cities_2))
print (get_min_ngrams(cities_3))
print (get_min_ngrams(cities_4))
输出:
['ter', 'utr', 'gro', 'lem']
['wol', 'dam']
['dam']
['dam']
为什么上述解决方案并非总是最优的:如其他人所述,上述算法是贪婪的,可以将此问题简化为没有确定性多项式时间解的集合覆盖。因此,除非您想赢得一百万美元的奖金,否则解决提供最佳解决方案的多项式时间算法是徒劳的。因此,下一个最佳解决方案是贪婪。让我们看看贪婪解决方案与最优解决方案相比有多糟糕
贪婪有多糟:如果有X个城市,并且最好的解决方案是c
(即,您需要c
个ngram来覆盖所有X个城市,则贪婪解决方案不会比c*ln m
最糟糕,因此,如果您有50K
个城市,那么贪婪解决方案将以最理想情况{最大10.8197782844
倍的方式关闭。