Question

我有一些像这样的格式的txt文件 -

\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text

我希望将这些内容写入一个看起来像这样的词典 -

{'Intro': '\n text \n text \n', 
'Body': '\n text \n text', 
'Refs': '\n test \n text'}

我担心处理所有txt文件所需的时间，所以想要一种尽可能少花费时间的方法，我不在乎将文本拆分成行。

我正在尝试使用正则表达式，但我正在努力让它正常工作 - 我认为我的最后一个正则表达式组是不正确的。以下是我目前的情况。任何建议都会很棒。

pattern = r"(====.)(.+?\b)(.*)"
matches = re.findall(pattern, data, re.DOTALL) 
my_dict = {b:c for a,b,c in matches}

Answer 1

您不需要此处的RegEx，而是可以使用经典的split()功能。

在这里，我使用textwrap来提高可读性：

import textwrap

text = textwrap.dedent("""\

==== Intro 
 text 
 text 
==== Body 
 text 
 text 
==== Refs 
 test 
 text""")

你可以这样做：

result = {}
for part in text.split("==== "):
    if not part.isspace():
        section, content = part.split(' ', 1)
        result[section] = content

或者使用理解中的元组列表初始化dict：

result = dict(part.split(' ', 1)
              for part in text.split("==== ")
              if not part.isspace())

Answer 2

这应该做：

d = dict(re.findall('(?<=\n====\s)(\w+)(\s+[^=]+)', text, re.M | re.DOTALL))
print(d)
{'Body': ' \n text \n text \n',
 'Intro': ' \n text \n text \n',
 'Refs': ' \n test \n text'}

正则表达式详细信息

(?<=    # lookbehind (must be fixed width)
    \n      # newline
    ====    # four '=' chars in succession
    \s      # single wsp character
)
(       # first capture group
    \w+     # 1 or more alphabets (or underscore)    
)    
(       # second capture group
    \s+     # one or more wsp characters
    [^=]+   # match any char that is not an '='
)

Answer 3

你可以试试这个：

import re

s = "\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text"

final_data = re.findall("(?<=\n\=\=\=\=\s)[a-zA-Z]+\s", s)
text = re.findall("\n .*? \n .*?$|\n .*? \n .*? \n", s)
final_body = {a:b for a, b in zip(final_data, text)}

输出：

{'Body ': '\n text \n text \n', 'Intro ': '\n text \n text \n', 'Refs ': '\n test \n text'}

Answer 4

如果您不想将整个文件读入内存，可以像以下一样逐行处理：

marker = "==== "
def read_my_custom_format(file):
    current_header = None
    current_contents = []
    for line in file:
        line = line.strip() # trim whitespace, including trailing newline
        if line.startswith(marker):
            yield current_header, current_contents # emit current section
            current_header = line[len(marker):] # trim marker
            current_contents = []
        else:
            current_contents.append(line)

这是一个生成元组而不是构建字典的生成器。这样它在内存中一次只能保存一个部分。此外，每个键映射到一个行列表而不是一个字符串，但您可以轻松地"".join(iterable)它们。如果你想生成一个单独的字典，它再次占用与输入文件成比例的内存，你可以这样做：

with open("your_textfile.txt") as file:
    data = dict(read_my_custom_format(file))

因为dict()可以采用可重复的2元组

使用正则表达式分组将字符串转换为字典

4 个答案: