用树森林建模字符串序列

时间:2019-02-07 01:47:17

标签: python

我很难用python编写一个函数来分析在字符串列表中找到的字符串序列。此函数将输入n的整数和字符串的有序列表,然后将输出代表长度为n的字符串的唯一序列的树森林(也许最后一个序列除外)。

我不太确定如何实现此功能。我可以参考的任何建议或资源将不胜感激。

编辑:

考虑以下示例

strings = ['Hello', 'Tim', 'Fish', 'Fish', 'Hello', 'Tim', 'Fish']

然后build_forest(strings,3)将产生一个林,其结构如下:

Hello 
  | ___ Tim ___ Fish

 Tim
  | ___ Fish ___ Fish

Fish
  | ___ Fish ___ Hello
  | ___ Hello ___ Tim

3 个答案:

答案 0 :(得分:1)

您可以使用trie或前缀树来表示。使用this answerrolling window iterator的修改版本,您可以说:

from itertools import islice

def build_trie(paths):
    head = {}
    for path in paths:
        curr = head
        for item in path:
            if item not in curr:
                curr[item] = {}
            curr = curr[item]
    return head

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

from pprint import pprint

pprint(build_trie(window(strings, 3)))

打印

{'Fish': {'Fish': {'Hello': {}}, 
          'Hello': {'Tim': {}}},
 'Hello': {'Tim': {'Fish': {}}},
 'Tim': {'Fish': {'Fish': {}}}}

答案 1 :(得分:1)

这与建立马尔可夫模型非常相似,不同之处在于您对以下n-1个可能的序列有多个分支,并且没有考虑概率。

关于树的表示方式,您是否有任何特定的想法?

一个简单的解决方案可能涉及以下内容:

class TreeNode:
   def __init___(string):
      self.string = string
      self.children = {}

   def is_child(child_name):
      return child_name in self.children

   def add_child(child_name):
      new_child = TreeNode(child_name)
      self.children[child_name] = new_child
      return new_child

   def get_child(child_name):
      return self.children[child_name]


def make_tree(string_seq, n)
   trees = {}
   for idx in range(len(string_seq) - n):
      # For each possible starts to a tree, check if any trees
      # have begun with that string, and if so add to that tree,
      # otherwise, make a new one.
      tree_position = None
      if string_seq[idx] not in trees:
         tree_position = TreeNode(string[idx])
         trees[string_seq[idx]] = tree_position
      else:
         tree_position = trees[string_seq[idx]]

      # Continue making new branches for any new strings that appear.
      for offset in range(1, n - 1):
         if not tree_position.is_child(string_seq[idx + offset]):
            tree_position.add_child(string_set[idx + offset])
         tree_position = tree_position.get_child(string_set[idx + offset])
   return trees

答案 2 :(得分:0)

从示例数据中,描述问题的另一种方法是:

  • 给出了n个字符串的序列,
  • 对于所有长度为m(m
  • 生成可有效存储这些子序列的树数据结构,
  • 以使子序列的第一个元素位于顶层
  • 第二个元素位于其下的第一层,依此类推,
  • 在一个特定的父节点下,每个节点都不会重复

合适的数据结构是字典,它看起来像:

{
    'Hello': {
        'Tim': {
            'Fish': {}
        }
    },
    'Tim': {
        'Fish': {
            'Fish': {}
        }
    },
    'Fish': {
        'Fish': {
            'Hello': {}
        },
        'Hello': {
            'Tim': {}
        }
    },

将其转换为代码:

example = ['Hello', 'Tim', 'Fish', 'Fish', 'Hello', 'Tim', 'Fish']


def build_forest(strings, sequence_length):
    assert sequence_length < len(strings)
    # start with an empty dictionary
    result = {}
    # iterate over all sub-sequences of the given length
    for sequence in [strings[i:i + sequence_length] for i in range(len(strings) + 1 - sequence_length)]:
        # keep track of the dictionary at the correct level we're looking at
        d = result
        # try to get all the keys of the sequence in, in order
        for key in sequence:
            # if it wasn't at the current level, add a new dictionary
            if key not in d:
                d[key] = {}
            # start looking at the next level (either new or old)
            d = d[key]
    # at the end, return the constructed dictionary
    return result


print(build_forest(example, 3))