如何在Python中解析“此处文档”?

时间:2019-05-09 08:48:34

标签: python heredoc

我想编写一个Python方法,该方法读取具有键值的文本文件:

FOO=BAR
BUZ=BLEH

我还希望通过引用和\n以及通过支持here-docs来支持换行符:

MULTILINE1="This\nis a test"
MULTILINE2= <<DOC
This
is a test
DOC

虽然第一个很容易实现,但是我在努力第二个。在Python的stdlib(例如shlex)中可能已经可以使用的东西了吗?

2 个答案:

答案 0 :(得分:1)

“ test.txt”内容:

source2

功能

FOO=BAR
BUZ=BLEH
MULTILINE1="This\nis a test"
MULTILINE2= <<DOC
This
is a test
DOC

用法:

def read_strange_file(filename):
    with open(filename) as f:
        file_content = f.read().splitlines()

    res = {}
    key, value, delim = "", "", ""
    for line in file_content:
        if "=" in line and not delim:
            key, value = line.split("=")
            if value.strip(" ").startswith("<<"):
                delim = value.strip(" ")[2:] # extracting delimiter keyword
                value = ""
                continue
        if not delim or (delim and line == delim):
            if value.startswith("\"") and value.endswith("\""):
                # [1: -1] delete quotes
                value = bytes(value[1: -1], "utf-8").decode("unicode_escape") 
            if delim:
                value = value[:-1] # delete "\n"
            res[key] = value
            delim = ""
        if delim:
            value += line + "\n"

    return res

输出:

result = read_strange_file("test.txt")
print(result)

答案 1 :(得分:-1)

我假设这是测试字符串(即,每行末尾有看不见的\n字符):

s = ''
s += 'MULTILINE1="This\nis a test"\n'
s += 'MULTILINE2= <<DOC\n'
s += 'This\n'
s += 'is a test\n'
s += 'DOC\n'

我能做的最好的就是用NumPy作弊:

import numpy as np

A  = np.asarray([ss.rsplit('\n', 1)  for ss in ('\n'+s).split('=')])
keys   = A[:-1,1].tolist()
values = A[1:,0].tolist()

#optionally parse here-documents
di     = 'DOC' #delimiting identifier
values = [v.strip().lstrip('<<%s\n'%di).rstrip('\n%s'%di) for v in values]

print('Keys: ', keys)
print('Values: ', values)

#if you want a dictionary:
d      = dict( zip(keys, values) )

结果是:

Keys:  ['MULTILINE1', 'MULTILINE2']
Values:  ['"This\nis a test"', '"This\nis a test"']

通过在字符串的开头偷偷地添加一个\n字符,然后将整个字符串除以=个字符,然后最后使用rsplit来将所有值保留在右边来工作即使这些值包含多个=字符,也要使用\n的值。打印数组A使事情更清楚:

[['',                             'MULTILINE1'],
 ['"This\nis a test"',            'MULTILINE2'],
 [' <<DOC\nThis\nis a test\nDOC', ''         ]]