如何不匹配不包含两个连续换行符的字符串

时间:2015-05-24 13:37:35

标签: python regex

regex101的演示。我有以下文本文件(bibtex .bbl文件):

\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{a}})\textit{Alfonsi, Spogli,
  De~Franceschi, Romano, Aquino, Dodson, and Mitchell}}]{alfonsi2011bcg}
Alfonsi, L., L.~Spogli, G.~De~Franceschi, V.~Romano, M.~Aquino, A.~Dodson, and
  C.~N. Mitchell (2011{\natexlab{a}}), Bipolar climatology of {GPS} ionospheric
  scintillation at solar minimum, \textit{Radio Science}, \textit{46}(3),
  \doi{10.1029/2010RS004571}.

\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{b}})\textit{Alfonsi, Spogli,
  Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, and
  Mitchell}}]{alfonsi2011gsa}
Alfonsi, L., L.~Spogli, J.~Tong, G.~De~Franceschi, V.~Romano, A.~Bourdillon,
  M.~Le~Huy, and C.~Mitchell (2011{\natexlab{b}}), {GPS} scintillation and
  {TEC} gradients at equatorial latitudes in april 2006, \textit{Advances in
  Space Research}, \textit{47}(10), 1750--1757,
  \doi{10.1016/j.asr.2010.04.020}.

\bibitem[{\textit{Anghel et~al.}(2008)\textit{Anghel, Astilean, Letia, and
  Komjathy}}]{anghel2008nrm}
Anghel, A., A.~Astilean, T.~Letia, and A.~Komjathy (2008), Near real-time
  monitoring of the ionosphere using dual frequency {GPS} data in a kalman
  filter approach, in \textit{{IEEE} International Conference on Automation,
  Quality and Testing, Robotics, 2008. {AQTR} 2008}, vol.~2, pp. 54--58,
  \doi{10.1109/AQTR.2008.4588793}.

\bibitem[{\textit{Baker and Wing}(1989)}]{baker1989nmc}
Baker, K.~B., and S.~Wing (1989), A new magnetic coordinate system for
  conjugate studies at high latitudes, \textit{Journal of Geophysical Research:
  Space Physics}, \textit{94}(A7), 9139--9143, \doi{10.1029/JA094iA07p09139}.

如果我知道命令末尾的参考代码,我想将整个\bibitem命令与单个条目(带有一些捕获组)匹配。我使用这个正则表达式,它适用于第一个条目,但不适用于其余条目(下面举例说明的第二个条目):

\\bibitem\[{(.*?)\((.*?)\)(.*?)}\]{alfonsi2011gsa}

这不起作用,因为它匹配从第一个\bibitem命令的开头到第二个\bibitem命令的结尾的所有内容。我怎样才能匹配第二个\bibitem命令?我尝试对^$\n\n使用否定前瞻,但我无法工作 - 基本上,我希望第三个(.*?)匹配任何不包括两个字符串的字符串连续换行。 (如果还有其他任何方法可以做到这一点,我全都耳朵。)

2 个答案:

答案 0 :(得分:1)

您可以使用负面环顾(?!)来防止匹配多次出现'bibitem'。这样,匹配将从紧跟在您的参考代码之前的'bibitem'开始。这似乎有效:

\\bibitem\[{(((?!bibitem).)*?)\((((?!bibitem).)*?)\)(((?!bibitem).)*?)}\]{alfonsi2011gsa}

答案 1 :(得分:0)

正则表达式不是我的强项,但这将获得您想要的所有内容,而无需立即将所有内容读入内存:

from itertools import groupby
import re
with open("file.txt") as f:
    r = re.compile(r"\[{(.*?)\((.*?)\)(.*?)}\]\{alfonsi2011gsa\}")
    for k, v in groupby(map(str.strip, f), key=lambda x: bool(x.strip())):
        match = r.search("".join(v))
        if match:
             print(match.groups())


('\\textit{Alfonsi et~al.}', '2011{\\natexlab{b}}', '\\textit{Alfonsi, Spogli,Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, andMitchell}')