Question

我正在尝试使用python中的regex从LaTeX文档中提取引用的BibTeX密钥。

如果引用被注释掉（前面是％），我想排除该引用，但是如果前面有一个百分号（\％），则仍要包括该引用。

这是我到目前为止的想法：

\\(?:no|)cite\w*\{(.*?)\}

一个尝试的例子：

blablabla
Author et. al \cite{author92} bla bla. % should match
\citep{author93} % should match
\nocite{author94} % should match
100\%\nocite{author95} % should match
100\% \nocite{author95} % should match
%\nocite{author96} % should not match
\cite{author97, author98, author99} % should match
\nocite{*} % should not match

Regex101测试：https://regex101.com/r/ZaI8kG/2/

感谢您的帮助。

Answer 1

使用更新的regex模块（pip install regex），其表达式如下：

(?<!\\)%.+(*SKIP)(*FAIL)|\\(?:no)?citep?\{(?P<author>(?!\*)[^{}]+)\}

请参见a demo on regex101.com。

更详细：

(?<!\\)%.+(*SKIP)(*FAIL)     # % (not preceded by \) 
                             # and the whole line shall fail
|                            # or
\\(?:no)?citep?              # \nocite, \cite or \citep
\{                           # { literally
    (?P<author>(?!\*)[^{}]+) # must not start with a star
\}                           # } literally

如果无法安装其他库，则需要将表达式更改为

(?<!\\)%.+
|
(\\(?:no)?citep?
\{
    ((?!\*)[^{}]+)
\})

，并且需要以编程方式检查是否已设置第二个捕获组（即不为空）。
后者可能在Python中：

import re

latex = r"""
blablabla
Author et. al \cite{author92} bla bla. % should match
\citep{author93} % should match
\nocite{author94} % should match
100\%\nocite{author95} % should match
100\% \nocite{author95} % should match
%\nocite{author96} % should not match
\cite{author97, author98, author99} % should match
\nocite{*} % should not match
"""

rx = re.compile(r'''(?<!\\)%.+|(\\(?:no)?citep?\{((?!\*)[^{}]+)\})''')

authors = [m.group(2) for m in rx.finditer(latex) if m.group(2)]
print(authors)

哪个产量

['author92', 'author93', 'author94', 'author95', 'author95', 'author97, author98, author99']

Answer 2

我没有遵循上一个逻辑，在我看来*可能不需要{}，在这种情况下，也许您想设计一个类似于：

^(?!(%\\(?:no)?cite\w*\{([^}]*?)\}))[^*\n]*$

虽然不确定。

使用python中的正则表达式从tex文件中提取引用的bibtex密钥

2 个答案:

DEMO