我很难用python编写一个函数来分析在字符串列表中找到的字符串序列。此函数将输入n的整数和字符串的有序列表,然后将输出代表长度为n的字符串的唯一序列的树森林(也许最后一个序列除外)。
我不太确定如何实现此功能。我可以参考的任何建议或资源将不胜感激。
编辑:
考虑以下示例
strings = ['Hello', 'Tim', 'Fish', 'Fish', 'Hello', 'Tim', 'Fish']
然后build_forest(strings,3)将产生一个林,其结构如下:
Hello
| ___ Tim ___ Fish
Tim
| ___ Fish ___ Fish
Fish
| ___ Fish ___ Hello
| ___ Hello ___ Tim
答案 0 :(得分:1)
您可以使用trie或前缀树来表示。使用this answer和rolling window iterator的修改版本,您可以说:
from itertools import islice
def build_trie(paths):
head = {}
for path in paths:
curr = head
for item in path:
if item not in curr:
curr[item] = {}
curr = curr[item]
return head
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
from pprint import pprint
pprint(build_trie(window(strings, 3)))
打印
{'Fish': {'Fish': {'Hello': {}},
'Hello': {'Tim': {}}},
'Hello': {'Tim': {'Fish': {}}},
'Tim': {'Fish': {'Fish': {}}}}
答案 1 :(得分:1)
这与建立马尔可夫模型非常相似,不同之处在于您对以下n-1个可能的序列有多个分支,并且没有考虑概率。
关于树的表示方式,您是否有任何特定的想法?
一个简单的解决方案可能涉及以下内容:
class TreeNode:
def __init___(string):
self.string = string
self.children = {}
def is_child(child_name):
return child_name in self.children
def add_child(child_name):
new_child = TreeNode(child_name)
self.children[child_name] = new_child
return new_child
def get_child(child_name):
return self.children[child_name]
def make_tree(string_seq, n)
trees = {}
for idx in range(len(string_seq) - n):
# For each possible starts to a tree, check if any trees
# have begun with that string, and if so add to that tree,
# otherwise, make a new one.
tree_position = None
if string_seq[idx] not in trees:
tree_position = TreeNode(string[idx])
trees[string_seq[idx]] = tree_position
else:
tree_position = trees[string_seq[idx]]
# Continue making new branches for any new strings that appear.
for offset in range(1, n - 1):
if not tree_position.is_child(string_seq[idx + offset]):
tree_position.add_child(string_set[idx + offset])
tree_position = tree_position.get_child(string_set[idx + offset])
return trees
答案 2 :(得分:0)
从示例数据中,描述问题的另一种方法是:
合适的数据结构是字典,它看起来像:
{
'Hello': {
'Tim': {
'Fish': {}
}
},
'Tim': {
'Fish': {
'Fish': {}
}
},
'Fish': {
'Fish': {
'Hello': {}
},
'Hello': {
'Tim': {}
}
},
将其转换为代码:
example = ['Hello', 'Tim', 'Fish', 'Fish', 'Hello', 'Tim', 'Fish']
def build_forest(strings, sequence_length):
assert sequence_length < len(strings)
# start with an empty dictionary
result = {}
# iterate over all sub-sequences of the given length
for sequence in [strings[i:i + sequence_length] for i in range(len(strings) + 1 - sequence_length)]:
# keep track of the dictionary at the correct level we're looking at
d = result
# try to get all the keys of the sequence in, in order
for key in sequence:
# if it wasn't at the current level, add a new dictionary
if key not in d:
d[key] = {}
# start looking at the next level (either new or old)
d = d[key]
# at the end, return the constructed dictionary
return result
print(build_forest(example, 3))