Question

嗨我可以用手来解决以下问题。我正在尝试编写一个python脚本，它将从tex文件中提取数字并将它们放入另一个文件中。输入文件是这样的：

\documentclass[]....
\begin{document}

% More text

\begin{figure}    
figure_info 1
\end{figure}

\begin{figure}    
figure_info 2
\end{figure}    

%More text

输出文件应该是这样的：

\begin{figure}    
figure_info 1
\end{figure}

\begin{figure}    
figure_info 2
\end{figure}

感谢您的帮助。

Answer 1

非常感谢我最终以这种方式完成的答案。它可能不是最佳方式，但它的工作原理。我尝试了几种建议的解决方案，但他们需要进行一些调整才能使它们发挥作用。

infile = open('data.tex', 'r')
outfile = open('result.tex', 'w')
extract_block = False
for line in infile:
    if 'begin{figure}' in line:
        extract_block = True
    if extract_block:
        outfile.write(line)
    if 'end{figure}' in line:
        extract_block = False
        outfile.write("------------------------------------------\n\n")

infile.close()
outfile.close()

Answer 2

您可以使用正则表达式（re模块）findall()函数来执行此操作。需要注意的事项是：

使用re.DOTALL标志允许“。”匹配换行符，
该点上的“懒惰”运算符（“。*？”中的问号），这意味着正则表达式不会贪婪地超过第一个\end{figure}以寻找最长的匹配
确保你的正则表达式字符串是r'raw string'否则你必须将每个正则表达式反斜杠转义为“\\”，并将正则表达式中的文字反斜杠转义为“\\\\”。硬编码输入字符串也是如此。

我们走了：

import re

TEXT = r"""\documentclass[]....
\begin{document}

% More text

\begin{figure}
figure_info 1
\end{figure}

\begin{figure}
figure_info 2
\end{figure}

%More text
"""

RE = r'(\\begin\{figure\}.*?\\end\{figure\})'

m = re.findall(RE, TEXT, re.DOTALL)

if m:
    for match in m:
        print match
        print '' #blank line

Answer 3

import re

# re.M means match across line boundaries
# re.DOTALL means the . wildcard matches \n newlines as well
pattern = re.compile('\\\\begin\{figure\}.*?\\\\end\{figure\}', re.M|re.DOTALL)

# 'with' is the preferred way of opening files; it
#    ensures they are always properly closed
with open("file1.tex") as inf, open("fileout.tex","w") as outf:
    for match in pattern.findall(inf.read()):
        outf.write(match)
        outf.write("\n\n")

编辑：发现了问题 - 不是在正则表达式中，而是在我正在匹配的测试文本中（我忘了逃避其中的\ b）。

Answer 4

我可能会采取简单的方法，将整个文件读入一个字符串变量。这个

import string

f = open('/tmp/workfile', 'r')
f = f.read()

text = string.split(f,"\begin{figure} ")

text.pop(0)

for a in text:
    a = string.split(a,"\end{figure}")
    print "\begin{figure}\n"
    print a[0]
    print "\end{figure}"

您可以从命令行执行此操作：

your_script.py > output_file.tex

从乳胶文件中提取数据

4 个答案: