从python中的缩进文本文件创建树/深层嵌套的dict

时间:2013-07-25 12:45:14

标签: python parsing data-structures dictionary nested

基本上,我想迭代一个文件并将每行的内容放入一个深度嵌套的dict中,其结构由每行开头的空白量定义。

基本上,目标是采取这样的方式:

a
    b
        c
    d
        e

把它变成这样的东西:

{"a":{"b":"c","d":"e"}}

或者这个:

apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10

进入这个:

{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}

这样我就可以将它发送到Python的JSON模块并制作一些JSON。

目前我正试图按照这样的步骤制作一个字典和一个列表:

  1. {"a":""} ["a"]
  2. {"a":"b"} ["a"]
  3. {"a":{"b":"c"}} ["a","b"]
  4. {"a":{"b":{"c":"d"}}}} ["a","b","c"]
  5. {"a":{"b":{"c":"d"},"e":""}} ["a","e"]
  6. {"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
  7. {"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]
  8. 该列表的行为类似于“breadcrumbs”,显示我上次放入词典的位置。

    要做到这一点,我需要一种方法来遍历列表并生成类似dict["a"]["e"]["f"]的内容以获取最后一个字典。我已经看过有人制作的AutoVivification类看起来非常有用,但我真的不确定:

    1. 我是否正在使用正确的数据结构(我打算将其发送到JSON库以创建JSON对象)
    2. 如何在此实例中使用AutoVivification
    3. 是否有更好的方法来解决这个问题。
    4. 我提出了以下功能,但它不起作用:

      def get_nested(dict,array,i):
      if i != None:
          i += 1
          if array[i] in dict:
              return get_nested(dict[array[i]],array)
          else:
              return dict
      else:
          i = 0
          return get_nested(dict[array[i]],array)
      

      非常感谢帮助!

      (其余的非常不完整的代码在这里:)

      #Import relevant libraries
      import codecs
      import sys
      
      #Functions
      def stripped(str):
          if tab_spaced:
              return str.lstrip('\t').rstrip('\n\r')
          else:
              return str.lstrip().rstrip('\n\r')
      
      def current_ws():
          if whitespacing == 0 or not tab_spaced:
              return len(line) - len(line.lstrip())
          if tab_spaced:
              return len(line) - len(line.lstrip('\t\n\r'))
      
      def get_nested(adict,anarray,i):
          if i != None:
              i += 1
              if anarray[i] in adict:
                  return get_nested(adict[anarray[i]],anarray)
              else:
                  return adict
          else:
              i = 0
              return get_nested(adict[anarray[i]],anarray)
      
      #initialise variables
      jsondict = {}
      unclosed_tags = []
      debug = []
      
      vividfilename = 'simple.vivid'
      # vividfilename = sys.argv[1]
      if len(sys.argv)>2:
          jsfilename = sys.argv[2]
      else:
          jsfilename = vividfilename.split('.')[0] + '.json'
      
      whitespacing = 0
      whitespace_array = [0,0]
      tab_spaced = False
      
      #open the file
      with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
          for line in vividfile:
              #work out how many whitespaces at start
              whitespace_array.append(current_ws())
      
              #For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
              if whitespacing == 0 and whitespace_array[-1] > 0:
                  whitespacing = whitespace_array[-1]
                  if line[0] == '\t':
                      tab_spaced = True
      
              #strip out whitespace at start and end
              stripped_line = stripped(line)
      
              if whitespace_array[-1] == 0:
                  jsondict[stripped_line] = ""
                  unclosed_tags.append(stripped_line)
      
              if whitespace_array[-2] < whitespace_array[-1]:
                  oldnested = get_nested(jsondict,whitespace_array,None)
                  print oldnested
                  # jsondict.pop(unclosed_tags[-1])
                  # jsondict[unclosed_tags[-1]]={stripped_line:""}
                  # unclosed_tags.append(stripped_line)
      
              print jsondict
              print unclosed_tags
      
      print jsondict
      print unclosed_tags
      

4 个答案:

答案 0 :(得分:5)

这是一个递归解决方案。首先,按以下方式转换输入。

输入:

person:
    address:
        street1: 123 Bar St
        street2: 
        city: Madison
        state: WI
        zip: 55555
    web:
        email: boo@baz.com

第一步输出:

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'boo@baz.com','level':2}]

使用split(':')和计算前导标签的数量很容易实现:

def tab_level(astr):
    """Count number of leading tabs in a string
    """
    return len(astr)- len(astr.lstrip('\t'))

然后将第一步输出输入以下函数:

def ttree_to_json(ttree,level=0):
    result = {}
    for i in range(0,len(ttree)):
        cn = ttree[i]
        try:
            nn  = ttree[i+1]
        except:
            nn = {'level':-1}

        # Edge cases
        if cn['level']>level:
            continue
        if cn['level']<level:
            return result

        # Recursion
        if nn['level']==level:
            dict_insert_or_append(result,cn['name'],cn['value'])
        elif nn['level']>level:
            rr = ttree_to_json(ttree[i+1:], level=nn['level'])
            dict_insert_or_append(result,cn['name'],rr)
        else:
            dict_insert_or_append(result,cn['name'],cn['value'])
            return result
    return result

其中:

def dict_insert_or_append(adict,key,val):
    """Insert a value in dict at key if one does not exist
    Otherwise, convert value to list and append
    """
    if key in adict:
        if type(adict[key]) != list:
            adict[key] = [adict[key]]
        adict[key].append(val)
    else:
        adict[key] = val

答案 1 :(得分:2)

以下代码将采用块缩进文件并转换为XML树;这样:

foo
bar
baz
  ban
  bal

...变为:

<cmd>foo</cmd>
<cmd>bar</cmd>
<block>
  <name>baz</name>
  <cmd>ban</cmd>
  <cmd>bal</cmd>
</block>

基本技术是:

  1. 将缩进设为0
  2. 对于每一行,获取缩进
  3. 如果&gt;当前,降压并将当前块/标识保存在堆栈上
  4. 如果==当前,请附加到当前块
  5. 如果&lt;当前,从堆栈中弹出,直到找到匹配的缩进
  6. 所以:

    from lxml import builder
    C = builder.ElementMaker()
    
    def indent(line):
        strip = line.lstrip()
        return len(line) - len(strip), strip
    
    def parse_blockcfg(data):
        top = current_block = C.config()
        stack = []
        current_indent = 0
    
        lines = data.split('\n')
        while lines:
            line = lines.pop(0)
            i, line = indent(line)
    
            if i==current_indent:
                pass
    
            elif i > current_indent:
                # we've gone down a level, convert the <cmd> to a block
                # and then save the current ident and block to the stack
                prev.tag = 'block'
                prev.append(C.name(prev.text))
                prev.text = None
                stack.insert(0, (current_indent, current_block))
                current_indent = i
                current_block = prev
    
            elif i < current_indent:
                # we've gone up one or more levels, pop the stack
                # until we find out which level and return to it
                found = False
                while stack:
                    parent_indent, parent_block = stack.pop(0)
                    if parent_indent==i:
                        found = True
                        break
                if not found:
                    raise Exception('indent not found in parent stack')
                current_indent = i
                current_block = parent_block
    
            prev = C.cmd(line)
            current_block.append(prev)
    
        return top
    

答案 2 :(得分:2)

这是基于嵌套Node对象的复合结构的面向对象方法。

输入:

indented_text = \
"""
apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10
"""

Node类

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

要解析文本,请首先创建一个根节点。 然后,从文本中删除空行,并为每行创建一个Node实例,并将其传递给根节点的add_children方法。

root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)

结果:

{'apple': [
  {'colours': ['red', 'yellow', 'green']},
  {'type': 'granny smith'},
  {'price': '0.10'}]
}

我认为应该可以一步完成,只需将缩进的文本作为参数,一次调用Node的构造函数即可。

答案 3 :(得分:0)

首先,不要使用arraydict作为变量名,因为它们是Python中的保留字,重用它们可能会导致各种混乱。

好的,如果我告诉你,你有一个文本文件中给出的树,缩进表示父母,你想要恢复实际的树结构。正确?

以下内容是否有效?因为我无法将当前代码放入上下文中。

result = {}
last_indentation = 0
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
   # sibling to last
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

好的,那么你的列表是当前的父母,这实际上是正确的 - 但我会让他们指向你创建的字典,而不是文字

在这里开始一些事情

result = {}
parents = {}
last_indentation = 1 # start with 1 so 0 is the root of tree
parents[0] = result
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
       new_el = {}
       parents[i-1][c] = new_el
       parents[i] = new_el
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level