Question

我想从大括号{inside}中提取文本。这些文本之间的差异是前缀，例如\section{或\subsection{，以便相应地对所有内容进行分类。每一端都需要由下一个封闭的大括号}设置。

file = "This is a string of an \section{example file} used for \subsection{Latex} documents."

# These are some Latex commands to be considered:

heading_1 = "\\\\section{"
heading_2 = "\\\\subsection{"

# This is my attempt.

for letter in file:
    print("The current letter: " + letter + "\n")

我想通过使用Python将其转换为我的数据库来处理Latex文件。

Answer 1

我认为你想使用正则表达式模块。

import re

s = "This is a string of an \section{example file} used for \subsection{Latex} documents."

pattern = re.compile(r'\\(?:sub)?section\{(.*?)\}')
re.findall(pattern, s)

#output:
['example file', 'Latex']

Answer 2

如果您只想对所有文件使用(section-level, title)对，则可以使用简单的正则表达式：

import re

codewords = [
    'section',
    'subsection',
    # add other here if you want to
]

regex = re.compile(r'\\({})\{{([^}}]+)\}}'.format('|'.join(re.escape(word) for word in codewords)))

样本用法：

In [15]: text = '''
    ...: \section{First section}
    ...: 
    ...: \subsection{Subsection one}
    ...: 
    ...: Some text
    ...: 
    ...: \subsection{Subsection two}
    ...: 
    ...: Other text
    ...: 
    ...: \subsection{Subsection three}
    ...: 
    ...: Some other text
    ...: 
    ...: 
    ...: Also some more text \texttt{other stuff}
    ...: 
    ...: \section{Second section}
    ...: 
    ...: \section{Third section}
    ...: 
    ...: \subsection{Last subsection}
    ...: '''

In [16]: regex.findall(text)
Out[16]: 
[('section', 'First section'),
 ('subsection', 'Subsection one'),
 ('subsection', 'Subsection two'),
 ('subsection', 'Subsection three'),
 ('section', 'Second section'),
 ('section', 'Third section'),
 ('subsection', 'Last subsection')]

通过更改codewords列表的值，您可以匹配更多类型的命令。

首先将read()首先应用于文件：

with open('myfile.tex') as f:
    regex.findall(f.read())

如果您保证所有这些命令都在同一行，那么您可以提高内存效率并执行：

打开（＆＃39; myfile.tex＆＃39;）为f： results = [] 对于f中的行： results.extends（regex.findall（线））

或者如果你想要更有趣：

from itertools import chain

with open('myfile.tex') as f:
    results = chain.from_iterable(map(regex.findall, f))

但请注意，如果您有类似的内容：

\section{A very 
    long title}

这将失败，为什么使用read()的解决方案也将获得该部分。

无论如何，你必须意识到格式上的最微小变化会破坏这些解决方案。因此，为了获得更安全的替代方案，您必须寻找合适的LaTeX解析器。

如果您想将小组＆＃34;组合在一起＆＃34;在给定部分中，您可以在使用上述解决方案获得结果后执行此操作。您必须使用类似itertools.groupby的内容。

来自itertools import groupby，count，chain

results = regex.findall(text)

def make_key(counter):
    def key(match):
        nonlocal counter
        val = next(counter)
        if match[0] == 'section':
            val = next(counter)
        counter = chain([val], counter)
        return val
    return key

organized_result = {}

for key, group in groupby(results, key=make_key(count())):
    _, section_name = next(group)
    organized_result[section_name] = section = []
    for _, subsection_name in group:
        section.append(subsection_name)

最终结果将是：

In [12]: organized_result
Out[12]: 
{'First section': ['Subsection one', 'Subsection two', 'Subsection three'],
 'Second section': [],
 'Third section': ['Last subsection']}

这与帖子开头的文字结构相符。

如果您想使用codewords列表进行此扩展，事情会变得更加复杂。

如何在前缀和后缀之间提取内容？

2 个答案: