Python最快的算法从字符串列表中获取最常见的前缀

时间:2018-07-21 13:18:08

标签: python python-3.x algorithm

我需要一个功能:

def get_prefix(list_of_strings):
  # Should give me the most common prefix
  # out of the given list_of_strings
  # of the lowest order of time possible

还可以在随后的调用中获得第二个最常见的前缀,依此类推。如果前缀的长度小于全局变量,例如min_length_of_prefix

,则应丢弃该前缀

例如:

['file_1', 'file_2', 'file_3', 'not_a_file_1', 'not_a_file_2']
min_length_of_prefix = 6
first call: 'not_a_file_'
second call: None

['file_1', 'file_2', 'file_3', 'not_a_file_1', 'not_a_file_2']
min_length_of_prefix = 4
first call: 'file_'
second call: 'not_a_file_'
third call: None

2 个答案:

答案 0 :(得分:2)

您可以为此使用Trie

每个字符串的插入花费O(n)(n =字符串的长度)。 通过在树上运行DFS,可以找到最小长度内的所有前缀。

这是我的实现方式。它将返回所有长度至少(prefix, frequency)个字符(按频率降序)的成对min_length_of_prefix对的列表。

class Node:
    def __init__(self, character):
        self.count = 1
        self.character = character
        self.children = {}

    def insert(self, string, idx):
        if idx >= len(string):
            return

        code = ord(string[idx])
        ch = string[idx]
        if ch in self.children:
            self.children[ch].count += 1
        else:
            self.children[ch] = Node(string[idx])

        self.children[ch].insert(string, idx+1)

class Trie:
    def __init__(self):
        self.root = Node('')

    def insert(self, string):
        self.root.insert(string, 0)

    # just a wrapper function
    def getPrefixes(self, min_length):
        # pair of prefix, and frequency
        # prefixes shorter than min_length are not stored
        self.prefixes = {}

        self._discoverPrefixes(self.root, [], min_length, 0)

        # return the prefixes in sorted order
        return sorted(self.prefixes.items(), key =lambda x : (x[1], x[0]), reverse= True)


    # do a dfa search on the trie
    # discovers the prefixes in the trie and stores them in the self.prefixes dictionary
    def _discoverPrefixes(self, node, prefix, min_length, len):
        # print(prefix)
        # print(node.count)
        if len >= min_length:
            self.prefixes[''.join(prefix)+node.character] = node.count

        for ch, ch_node in node.children.items():
            prefix.append(node.character)
            self._discoverPrefixes(ch_node, prefix, min_length, len+1)
            prefix.pop()



if __name__ == '__main__':
    strings = ['file_1', 'file_2', 'file_3', 'not_a_file_1', 'not_a_file_2']

    min_length_of_prefix = 6

    trie = Trie()

    for s in strings:
        trie.insert(s)

    prefixes = trie.getPrefixes(min_length_of_prefix)

    print(prefixes)

输出:

[('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2), ('not_a_f', 2), ('not_a_', 2), ('not_a_file_2', 1), ('not_a_file_1', 1), ('file_3', 1), ('file_2', 1), ('file_1', 1)]

答案 1 :(得分:1)

首先对列表进行排序,以便我们可以使用itertools.groupby将每个字符串的第一个字符作为前缀进行分组,并且对于具有多个成员的每个组,请通过递归调用该字符串将字符与返回的每个前缀连接起来其余字符串使用相同的get_prefix函数,除非没有其他前缀返回,否则将返回一个空字符串。在每个递归级别上,每个组中的成员数也将以前缀作为元组返回,以便最终我们可以将其用作排序的键,以确保更常见的前缀排在首位。

from itertools import groupby
from operator import itemgetter
list_of_strings = ['file_4', 'not_a_f', 'file_1', 'file_2', 'file_3', 'not_a_file_1', 'not_a_file_2']
def get_prefix(l, m):
    if not l: return []
    if m is not None: l.sort()
    r = [(k + p, f or len(g)) for k, g in [(k, list(g)) for k, g in groupby(l, itemgetter(0))] if len(g) > 1 for p, f in get_prefix([s[1:] for s in g if len(s) > 1], None)] + [('', 0)]
    if m: return sorted([(p, f) for p, f in r if len(p) >= m], key=itemgetter(1), reverse=True)
    return r
print(get_prefix(list_of_strings, 4))
print(get_prefix(list_of_strings, 6))

这将输出:

[('file_', 4), ('file', 4), ('not_a_f', 3), ('not_a_', 3), ('not_a', 3), ('not_', 3), ('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2)]
[('not_a_f', 3), ('not_a_', 3), ('not_a_file_', 2), ('not_a_file', 2), ('not_a_fil', 2), ('not_a_fi', 2)]