在字符串列表中查找最小的唯一n元语法列表

时间:2019-03-13 10:56:29

标签: python algorithm

我有一个50K的字符串列表(城市名称),并且我需要一个最小的字符三元组(可能是n-gram)列表,其中每个字符串至少被一个三元组命中一次。考虑以下列表:     [“阿姆斯特丹”,“鹿特丹”,“哈勒姆”,“乌得勒支”,“格罗宁根”]

可识别的字母组合的列表长4,并且应该是(可选):

['ter', 'haa', 'utr', 'gro']

我认为我的解决方案可以找到正确的正确答案,但是在其他列表上使用时却给出了错误的答案。

from collections import Counter

def identifying_grams(list, n=3):

    def f7(seq):
        seen = set()
        seen_add = seen.add
        return [x for x in seq if not (x in seen or seen_add(x))]

    def ngrams(text, n=3):
        return [text[i:i + n] for i in range(len(text) - n + 1)]

    hits = []
    trigrams = []
    for item in list:
      #  trigrams += ngrams(item)
        trigrams += f7(ngrams(item))

    counts = Counter(trigrams).most_common()

    for trigram, count in counts:
        items = []
        for item in list:
            if trigram in item:
                hits.append(trigram)
                items.append(item)
        for i in items:
            list.remove(i)

    return(f7(hits))

list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
print(identifying_grams(list1))
# Good, we get: ['ter', 'haa', 'utr', 'gro']

list2 = ['amsterdam','schiedam']
print(identifying_grams(list2))
# Good, we get: ['dam']

list3 = ['amsterdam','schiedam','terwolde','wolstad']
print(identifying_grams(list3))
# Ouch, we get: ['ter', 'dam', 'wol']
# this should be ['dam', 'wol'] as this is only 2 trigrams that identify the list...

到目前为止,我有两个答案,但是它们都有缺陷。 Rupesh的一个适用于小于10个项目的列表。我的清单中有超过5万个商品。尽管不是完美的解决方案,但来自mujjiga的解决方案确实提出了解决方案。

Python Ninja的赏金,他提出了一个可扩展的完美解决方案。 奖励kuddos,如果它运行良好并在每次运行时都给出相同的解决方案!

4 个答案:

答案 0 :(得分:8)

这是@mujjiga答案的理论分析:

您可以创建共享相同ngram的单词类。您希望选择覆盖整个单词集的那些类中的最小数量(即ngram的最小数量)。这是set cover problem。不幸的是,这个问题是NP难题(不是 NP完整,感谢@mujjiga )。 (编辑:因此,没有已知的解决方案可以在合理的时间内为您提供预期的结果。。)贪婪算法几乎是最好的解决方案(请参见https://cs.stackexchange.com/questions/49777/is-greedy-algorithm-the-best-algorithm-for-set-cover-problem)。

请注意,即使是贪婪算法也可能给出奇怪的结果。取集合{a, b}, {b, c}, {c, d}和超集{a, b, c, d}。这三个子集最大。如果首先使用{b, c},则需要另外两个子集来覆盖超集。如果您使用{a, b}{c, d},那么两个子集就足够了。

让我们使用贪婪算法,并考虑实现。创建将ngram映射到单词的字典的代码非常简单:

all_words= ['amsterdam','schiedam','werkendam','amstelveen','schiebroek','werkstad','den haag','rotjeknor','gouda']
n=3
words_by_ngram = {}
for word in all_words:
    for ngram in (word[i:i+n] for i in range(0, len(word)-n+1)):
        words_by_ngram.setdefault(ngram, set()).add(word)

如果键setdefault存在,则get等效于ngram,否则创建空白集。这是O(|all_words|*|len max word|)的复杂性。

现在,我们希望采用单词最多的ngram并将这些单词从词典中删除。重复直到得到想要的单词。

这是简单的版本:

s = set(all_words) # the target
gs = set()
d = words_by_ngram.copy() # for the display
while s:
    # take the the best ngram
    ngram, words = max(d.items(), key=lambda i: len(i[1])) # sort on word count
    # remove the words from the dictionary and delete the ngrams whose words have been already found
    d = {k:v for k, v in ((k, v - words) for k, v in d.items()) if len(v)}
    gs.add(ngram) # add the ngram to the result
    s -= words # remove the words from the target

# check
assert set().union(*[words_by_ngram[g] for g in gs]) == set(all_words)
# display
for g in gs:
    print("{} -> {}".format(g, words_by_ngram[g]))

输出:

ams -> {'amstelveen', 'amsterdam'}
gou -> {'gouda'}
wer -> {'werkstad', 'werkendam'}
rot -> {'rotjeknor'}
dam -> {'amsterdam', 'werkendam', 'schiedam'}
sch -> {'schiebroek', 'schiedam'}
den -> {'den haag'}

由于要查找字典的最大值和更新的循环,因此第二步的操作复杂度为O(|all_words|*|ngrams|)。因此,总体复杂度为O(|all_words|*|ngrams|)

使用优先级队列可以降低复杂性。检索最佳ngram的开销为O(1),但是更新映射到ngram的单词的len的优先级为O(lg |ngrams|)

import heapq
class PriorityQueue:
    """Adapted from https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes
    A prority of 1 invalidates the entries
    """
    def __init__(self, words_by_ngram):
        self._d = {ngram:[-len(words), (ngram, words)] for ngram, words in words_by_ngram.items()}
        self._pq = list(self._d.values())
        heapq.heapify(self._pq)

    def pop(self):
        """get the ngram, words tuple with the max word count"""
        minus_len, (ngram, words) = heapq.heappop(self._pq)
        while minus_len == 1: # entry is not valid
            minus_len, (ngram, words) = heapq.heappop(self._pq)
        return ngram, words

    def update(self, ngram, words_to_remove):
        """remove the words from the sets and update priorities"""
        del self._d[ngram]
        ngrams_to_inspect = set(word[i:i+n] for i in range(0, len(word)-n+1)
                        for word in words_to_remove)
        for ngram in ngrams_to_inspect:
            if ngram not in self._d: continue
            self._d[ngram][0] = 1 # use the reference to invalidate the entry
            [L, (ngram, words)] = self._d[ngram]
            words -= words_to_remove
            if words:
                self._d[ngram] = [-len(words), (ngram, words)] # new entry
                heapq.heappush(self._pq, self._d[ngram]) # add to the pq (O(lg ngrams))
            else: # nothing left: remove it from dict
                del self._d[ngram]


pq = PriorityQueue(words_by_ngram)
gs = set()
s = set(all_words) # the target
while s:
    # take the the best ngram
    ngram, words = pq.pop()
    gs.add(ngram) # add the ngram to the result
    s -= words # remove the words from the target
    # remove the words from the dictionary and update priorities
    pq.update(ngram, words)

使用此代码,总体优先级降至O(|all_words|*|lg ngrams|)。话虽如此,我很想知道这是否比拥有50k项的朴素的以前版本要快。

答案 1 :(得分:6)

下面是贪婪算法的实现,用于集合覆盖。它在我的机器上运行50,000个英语词典单词时大约需要半秒钟。输出并不总是最佳的,但实际上通常是接近的。您可能可以使用用于整数编程的外部库来解决实例问题,但我不知道您是否想朝那个方向发展。

下面的代码动态维护ngram和未覆盖文本的二分图。一个微妙的地方是,由于Python在其标准库中缺少侵入堆,因此我利用了这样的事实,即密钥只会增加为伪造的密钥。每个ngram在堆中的分数小于或等于应有的分数。当我们拉出最小值时,如果它小于应该的最小值,我们将其与更新后的值放回去。否则,我们知道这是真正的最小值。

此代码应产生确定的输出。在每个步骤中,它选择的词典最小ngram覆盖了未发现文本的最大数量。

import collections
import heapq


def ngrams_from_text(text, n):
    return {text[i:i + n] for i in range(len(text) - n + 1)}


def greedy_ngram_cover(texts, n):
    neighbors_of_text = {text: ngrams_from_text(text, n) for text in texts}
    neighbors_of_ngram = collections.defaultdict(set)
    for text, ngrams in neighbors_of_text.items():
        for ngram in ngrams:
            neighbors_of_ngram[ngram].add(text)
    heap = [(-len(neighbors), ngram)
            for (ngram, neighbors) in neighbors_of_ngram.items()]
    heapq.heapify(heap)
    cover = []
    while neighbors_of_text:
        score, ngram = heapq.heappop(heap)
        neighbors = neighbors_of_ngram[ngram]
        if score != -len(neighbors):
            heapq.heappush(heap, (-len(neighbors), ngram))
            continue
        cover.append(ngram)
        for neighbor in list(neighbors):
            for neighbor_of_neighbor in neighbors_of_text[neighbor]:
                neighbors_of_ngram[neighbor_of_neighbor].remove(neighbor)
            del neighbors_of_text[neighbor]
    return cover


print(
    greedy_ngram_cover(
        ['amsterdam', 'rotterdam', 'haarlem', 'utrecht', 'groningen'], 3))

答案 2 :(得分:3)

上述解决方案失败是因为

  • Counter以无序的方式返回三字母组,因此,如果您多次运行解决方案,您也会随机获得所需的解决方案
  • 找到组合后,您将立即返回该组合,既不会按长度顺序排列,也不会在满足该问题的所有组合中找到最佳组合

在这里,我按照从最低到最高的顺序包含三元组列表,然后在找到解决方案后立即返回。

from itertools import permutations

def checkTrigramsPresentInList(trigrams_list,input_list):
    for input_string in input_list:
        flag = False
        for trigram in trigrams_list:
            if trigram in input_string:
                flag = True
        if not flag:
            return False
    return True

def ngrams(text, n=3):
        return [text[i:i + n] for i in range(len(text) - n + 1)]

def identifying_grams(input_list, n=3):
    trigrams = []
    for item in input_list:
        trigrams += ngrams(item)
    len_of_trigrams = len(trigrams)
    trigrams_unique = list(set(trigrams))
    idx =1
    correct_tri_lists = []
    unique_trigrams_list = []
    while idx <= len_of_trigrams:
        trigrams_lists = permutations(trigrams_unique,idx)

        for trigrams_list in trigrams_lists:
            print(trigrams_list)
            if not trigrams_list in unique_trigrams_list:
                if checkTrigramsPresentInList(list(trigrams_list),input_list):
                    correct_tri_lists.append(list(trigrams_list))
            ##Uncomment below lines if only one combination is needed
                if correct_tri_lists:
                    return correct_tri_lists
                    unique_trigrams_list.append(trigrams_list)
        idx = idx+1


list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
print(identifying_grams(list1))
# # Good, we get: ['ter', 'haa', 'utr', 'gro']

list2 = ['amsterdam','schiedam']
print(identifying_grams(list2))
# # Good, we get: ['dam']

list3 = ['amsterdam','schiedam','terwolde','wolstad']
print(identifying_grams(list3))

答案 3 :(得分:2)

from nltk.util import ngrams

def load_dictonary(cities, n=3):
    ngram2cities = {}
    for city in cities:
        grams = [''.join(x) for x in ngrams(city,n)]
        for g in grams:
            if g in ngram2cities and city not in ngram2cities[g]:
                ngram2cities[g].append(city)
            else:
                ngram2cities[g] = [city]                
    return ngram2cities

def get_max(the_dict):
    n = 0
    the_max_key = None
    for key in the_dict :
        size = len(the_dict[key])
        if size > n:
            n = size
            the_max_key = key
    return the_max_key

def get_min_ngrams(cities, n=3):
    selected_ngrams = list()
    ngram2cities = load_dictonary(cities, n)
    ngram = get_max(ngram2cities)
    while ngram is not None:
        cities_covered = ngram2cities[ngram]
        ngram2cities.pop(ngram)
        selected_ngrams.append(ngram)        
        for city in cities_covered:
            for n in ngram2cities:
                if city in ngram2cities[n]:
                    ngram2cities[n].remove(city)
        ngram = get_max(ngram2cities)
    return selected_ngrams

cities_1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
cities_2 = ['amsterdam','schiedam','terwolde','wolstad']
cities_3 = ['amsterdam','schiedam']
cities_4 = ['amsterdam','walwalwalwaldam']

print (get_min_ngrams(cities_1))
print (get_min_ngrams(cities_2))
print (get_min_ngrams(cities_3))
print (get_min_ngrams(cities_4))

输出:

['ter', 'utr', 'gro', 'lem'] ['wol', 'dam'] ['dam'] ['dam']

  1. 创建结构为{'ngram':包含该ngram的城市列表}的字典
    1. 找到大多数城市所涵盖的ngram(例如x)(贪婪方法),然后删除该ngram并将其添加到解决方案中
    1. 现在,我们不必担心上面选定的ngram x所覆盖的城市,因此我们仔细研究字典并删除x所覆盖的城市。
  2. 从第1步开始重复,直到找不到更多的ngram为止

为什么上述解决方案并非总是最优的:如其他人所述,上述算法是贪婪的,可以将此问题简化为没有确定性多项式时间解的集合覆盖。因此,除非您想赢得一百万美元的奖金,否则解决提供最佳解决方案的多项式时间算法是徒劳的。因此,下一个最佳解决方案是贪婪。让我们看看贪婪解决方案与最优解决方案相比有多糟糕

贪婪有多糟:如果有X个城市,并且最好的解决方案是c(即,您需要c个ngram来覆盖所有X个城市,则贪婪解决方案不会比c*ln m最糟糕,因此,如果您有50K个城市,那么贪婪解决方案将以最理想情况{最大10.8197782844倍的方式关闭。