Question

我正在写一个文本转换器的乳胶，我的工作基于一个着名的Python解析器（python-latex）。我正在日复一日地改进它，但现在我在一行内解析多个命令时遇到了问题。 latex命令可以采用以下四种形式：

\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}

假设命令没有在行上分割，并且文本中可能有空格（但不在命令名中），我最终得到以下正则表达式来捕获一行中的命令：

'(\\.+\[*.*\]*\{.*\})'

事实上，一个示例程序正在运行：

string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)

>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']

嗯，说实话，我更喜欢这样的输出：

>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]

但是第一个可以工作。现在，如果在一行中有多个命令，我的问题就出现了，如下例所示：

dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']

正如您所看到的，结果与我想要的结果完全不同：

[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']

我曾尝试使用前瞻和回顾命题，但由于他们期望固定数量的字符，因此无法使用它们。我希望有一个解决方案。

谢谢！

Answer 1

您只需使用github.com/alvinwan/TexSoup即可完成此操作。这将为您提供您想要的内容，尽管保留了空格。

>>> from TexSoup import TexSoup
>>> string = "\documentclass[this is an option]{this is a text} this is other text ..."
>>> soup = TexSoup(string)
>>> list(soup.contents)
[\documentclass[this is an option]{this is a text}, ' this is other text ...']
>>> string2 = string + "\emph{tt}"
>>> soup2 = TexSoup(string2)
[\documentclass[this is an option]{this is a text}, ' this is other text ...', \emph{tt}]

免责声明：我知道（1）我发布了一年以后和（2）OP要求正则表达式，但假设任务与工具无关，我将此处留给有类似问题的人。另外，我写了TexSoup，所以请不要理睬这个建议。

Regexp在一行中捕获多个乳胶命令

1 个答案: