Question

我有一个这样的文本文件：

APPENDIX -- GLOSSARY
-------------------------------------------------------------------

  Asymmetrical Encryption:
      Encryption using a pair of keys--the first encrypts a

  Big-O Notation, Complexity:
      Big-O notation is a way of describing the governing.

      In noting complexity orders, constants and multipliers are
      conventionally omitted, leaving only the dominant factor.
      Compexities one often sees are:

      #*------------- Common Big-O Complexities ---------------#
      O(1)              constant

  Birthday Paradox:
      The name "birthday paradox" comes from the fact--surprising

  Cyclic Redundancy Check (CRC32):
      See Hash.  Based on mod 2 polynomial operations, CRC32 produces a
      32-bit "fingerprint" of a set of data.

  Idempotent Function:
      The property that applying a function to its return value
      'G=lambda x:F(F(F(...F(x)...)))'.

我想解析文本文件，使其输出如下：

{'Asymmetrical Encryption': Encryption using a pair of keys--the first encrypts a, 
'Big-O Notation, Complexity':'Big-O notation is a way of describing the governing. In noting complexity orders, constants and multipliers are conventionally omitted, leaving only the dominant factor. Compexities one often sees are: #*------------- Common Big-O Complexities ---------------# O(1)              constant}', ..so on }

这就是我所做的：

 dic = {}
    with open('appendix.txt', 'r') as f:
        data = f.read()
        lines = data.split(':\n\n')
        for line in lines:
            res = line.split(':\n      ')
            field = res[0]
            val = res[1:]

            dic[field] = val

尽管有标题，但这会弄乱文本中的:值。输出不正确。

Answer 1

如果要根据第一个空格解析文本，可以使用如下脚本：

class spaceParser(object):
    result = {}
    last_title = ""
    last_content = ""

    def process_content(self, content_line):
        if self.last_title:
            self.last_content = self.last_content + content_line.strip()
            self.result[self.last_title] = self.last_content

    def process_title(self, content_line):
        self.last_title = content_line.strip()
        self.last_content = ""

    def parse(self, raw_file):
        for line in raw_file:
            #look for patterns based in tabulation
            if line[0:4] == "    ":
                #content type
                self.process_content(line)
            elif line[0:2] == "  ":
                #title type
                self.process_title(line)
            else:
                #other types
                pass
        #append the last one
        self.process_content("")

parser = spaceParser()
with open('appendix.txt', 'r') as raw_file:
    parser.parse(raw_file)

print parser.result

获取dict中标题：section对上每个标题的部分

1 个答案: