我想从大括号{inside}中提取文本。这些文本之间的差异是前缀,例如\section{
或\subsection{
,以便相应地对所有内容进行分类。每一端都需要由下一个封闭的大括号}
设置。
file = "This is a string of an \section{example file} used for \subsection{Latex} documents."
# These are some Latex commands to be considered:
heading_1 = "\\\\section{"
heading_2 = "\\\\subsection{"
# This is my attempt.
for letter in file:
print("The current letter: " + letter + "\n")
我想通过使用Python将其转换为我的数据库来处理Latex文件。
答案 0 :(得分:0)
我认为你想使用正则表达式模块。
import re
s = "This is a string of an \section{example file} used for \subsection{Latex} documents."
pattern = re.compile(r'\\(?:sub)?section\{(.*?)\}')
re.findall(pattern, s)
#output:
['example file', 'Latex']
答案 1 :(得分:0)
如果您只想对所有文件使用(section-level, title)
对,则可以使用简单的正则表达式:
import re
codewords = [
'section',
'subsection',
# add other here if you want to
]
regex = re.compile(r'\\({})\{{([^}}]+)\}}'.format('|'.join(re.escape(word) for word in codewords)))
样本用法:
In [15]: text = '''
...: \section{First section}
...:
...: \subsection{Subsection one}
...:
...: Some text
...:
...: \subsection{Subsection two}
...:
...: Other text
...:
...: \subsection{Subsection three}
...:
...: Some other text
...:
...:
...: Also some more text \texttt{other stuff}
...:
...: \section{Second section}
...:
...: \section{Third section}
...:
...: \subsection{Last subsection}
...: '''
In [16]: regex.findall(text)
Out[16]:
[('section', 'First section'),
('subsection', 'Subsection one'),
('subsection', 'Subsection two'),
('subsection', 'Subsection three'),
('section', 'Second section'),
('section', 'Third section'),
('subsection', 'Last subsection')]
通过更改codewords
列表的值,您可以匹配更多类型的命令。
首先将read()
首先应用于文件:
with open('myfile.tex') as f:
regex.findall(f.read())
如果您保证所有这些命令都在同一行,那么您可以提高内存效率并执行:
打开(' myfile.tex')为f: results = [] 对于f中的行: results.extends(regex.findall(线))
或者如果你想要更有趣:
from itertools import chain
with open('myfile.tex') as f:
results = chain.from_iterable(map(regex.findall, f))
但请注意,如果您有类似的内容:
\section{A very
long title}
这将失败,为什么使用read()
的解决方案也将获得该部分。
无论如何,你必须意识到格式上的最微小变化会破坏这些解决方案。因此,为了获得更安全的替代方案,您必须寻找合适的LaTeX解析器。
如果您想将小组"组合在一起"在给定部分中,您可以在使用上述解决方案获得结果后执行此操作。您必须使用类似itertools.groupby
的内容。
来自itertools import groupby,count,chain
results = regex.findall(text)
def make_key(counter):
def key(match):
nonlocal counter
val = next(counter)
if match[0] == 'section':
val = next(counter)
counter = chain([val], counter)
return val
return key
organized_result = {}
for key, group in groupby(results, key=make_key(count())):
_, section_name = next(group)
organized_result[section_name] = section = []
for _, subsection_name in group:
section.append(subsection_name)
最终结果将是:
In [12]: organized_result
Out[12]:
{'First section': ['Subsection one', 'Subsection two', 'Subsection three'],
'Second section': [],
'Third section': ['Last subsection']}
这与帖子开头的文字结构相符。
如果您想使用codewords
列表进行此扩展,事情会变得更加复杂。