如何在前缀和后缀之间提取内容?

时间:2016-08-28 16:20:52

标签: python latex

我想从大括号{inside}中提取文本。这些文本之间的差异是前缀,例如\section{\subsection{,以便相应地对所有内容进行分类。每一端都需要由下一个封闭的大括号}设置。

file = "This is a string of an \section{example file} used for \subsection{Latex} documents."

# These are some Latex commands to be considered:

heading_1 = "\\\\section{"
heading_2 = "\\\\subsection{"

# This is my attempt.

for letter in file:
    print("The current letter: " + letter + "\n")

我想通过使用Python将其转换为我的数据库来处理Latex文件。

2 个答案:

答案 0 :(得分:0)

我认为你想使用正则表达式模块。

import re

s = "This is a string of an \section{example file} used for \subsection{Latex} documents."

pattern = re.compile(r'\\(?:sub)?section\{(.*?)\}')
re.findall(pattern, s)

#output:
['example file', 'Latex']

答案 1 :(得分:0)

如果您只想对所有文件使用(section-level, title)对,则可以使用简单的正则表达式:

import re

codewords = [
    'section',
    'subsection',
    # add other here if you want to
]

regex = re.compile(r'\\({})\{{([^}}]+)\}}'.format('|'.join(re.escape(word) for word in codewords)))

样本用法:

In [15]: text = '''
    ...: \section{First section}
    ...: 
    ...: \subsection{Subsection one}
    ...: 
    ...: Some text
    ...: 
    ...: \subsection{Subsection two}
    ...: 
    ...: Other text
    ...: 
    ...: \subsection{Subsection three}
    ...: 
    ...: Some other text
    ...: 
    ...: 
    ...: Also some more text \texttt{other stuff}
    ...: 
    ...: \section{Second section}
    ...: 
    ...: \section{Third section}
    ...: 
    ...: \subsection{Last subsection}
    ...: '''

In [16]: regex.findall(text)
Out[16]: 
[('section', 'First section'),
 ('subsection', 'Subsection one'),
 ('subsection', 'Subsection two'),
 ('subsection', 'Subsection three'),
 ('section', 'Second section'),
 ('section', 'Third section'),
 ('subsection', 'Last subsection')]

通过更改codewords列表的值,您可以匹配更多类型的命令。

首先将read()首先应用于文件:

with open('myfile.tex') as f:
    regex.findall(f.read())

如果您保证所有这些命令都在同一行,那么您可以提高内存效率并执行:

打开(' myfile.tex')为f:     results = []     对于f中的行:         results.extends(regex.findall(线))​​

或者如果你想要更有趣:

from itertools import chain

with open('myfile.tex') as f:
    results = chain.from_iterable(map(regex.findall, f))

但请注意,如果您有类似的内容:

\section{A very 
    long title}

这将失败,为什么使用read()的解决方案也将获得该部分。

无论如何,你必须意识到格式上的最微小变化会破坏这些解决方案。因此,为了获得更安全的替代方案,您必须寻找合适的LaTeX解析器。

如果您想将小组"组合在一起"在给定部分中,您可以在使用上述解决方案获得结果后执行此操作。您必须使用类似itertools.groupby的内容。

来自itertools import groupby,count,chain

results = regex.findall(text)

def make_key(counter):
    def key(match):
        nonlocal counter
        val = next(counter)
        if match[0] == 'section':
            val = next(counter)
        counter = chain([val], counter)
        return val
    return key

organized_result = {}

for key, group in groupby(results, key=make_key(count())):
    _, section_name = next(group)
    organized_result[section_name] = section = []
    for _, subsection_name in group:
        section.append(subsection_name)

最终结果将是:

In [12]: organized_result
Out[12]: 
{'First section': ['Subsection one', 'Subsection two', 'Subsection three'],
 'Second section': [],
 'Third section': ['Last subsection']}

这与帖子开头的文字结构相符。

如果您想使用codewords列表进行此扩展,事情会变得更加复杂。