Question

我正在尝试使用python中的正则表达式从纯文本中提取某些表的标题。

无格式测试是从一些包含大量\n的PDF文件中导出的。我试图在模式\n \n\n首次出现之前停止匹配，但是正则表达式总是向我返回更多字符。

这是一个例子。

字符串是：

contents = '\n\n\n\n\n\n\n\nClient: ABC area: Location Mc\nHole: 33-44   \n \n\n \n\nKJK TechCen    Rep # 5243 \n \n\n \n\n95 \n\nTable 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V \n% \n\nLiq/To \n% \n\nLiq/Sat \nBu \n\nDenCom'

我使用的正则表达式是：

re.findall(r'Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+ [^ \n \n\n ]', contents)

我希望结果字符串从'Table XXX'开始并在第一个' \n \n\n '之前结束，如下所示：

'Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF '

但是我得到的实际字符串是：

'Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V'

那么我该如何修改正则表达式以摆脱烦人的'\n \n\n PressRel V'？

Answer 1

您可以使用肯定的前瞻private short _oCommonOutputsPort; public short oCommonOutputsPort { get { return _oCommonOutputsPort; } set { SetField(ref _oCommonOutputsPort, value); oCommonOpLampRed = (oCommonOutputsPort & (1 << obitCommonOpLampRed)) != 0; oCommonOpLampGreen = (oCommonOutputsPort & (1 << obitCommonOpLampGreen)) != 0; oCommonMuteA = (oCommonOutputsPort & (1 << obitCommonMuteA)) != 0; oCommonMuteB = (oCommonOutputsPort & (1 << obitCommonMuteB)) != 0; } } private bool _oCommonOpLampRed; public bool oCommonOpLampRed { get { return _oCommonOpLampRed; } set { SetField(ref _oCommonOpLampRed, value); } } private bool _oCommonOpLampGreen; public bool oCommonOpLampGreen { get { return _oCommonOpLampGreen; } set { SetField(ref _oCommonOpLampGreen, value); } } } private bool _oCommonMuteA; public bool oCommonMuteA { get { return _oCommonMuteA; } set { SetField(ref _oCommonMuteA, value); } } private bool _oCommonMuteB; public bool oCommonMuteB { get { return _oCommonMuteB; } set { SetField(ref _oCommonMuteB, value); } } ...而不是使用字符类来断言应该紧随其后的是在右侧。

(?=

Regex demo

或者您可以在组中捕获值并匹配以下换行符

Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n )

Regex demo using a group

Answer 2

您需要一个非贪婪的+?而不是+，因为末尾序列中出现的所有字符都在中间的括号中。

end = r' \n \n\n '
result = re.findall(r'Table[^:]*:[a-zA-Z0-9 :&–=\n%@,()°-]+?' + end, contents)
#result = ['Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n ']

# to chop off the end, if needed:
result = [x[:-len(end)] for x in result]

示例中的[^ \n \n\n ]部分等于[^ \n]，“不是换行符或空格的字符”

如何停止以某种模式对一个字符串进行正则表达式的匹配？

2 个答案: