我正在尝试使用python中的正则表达式从纯文本中提取某些表的标题。
无格式测试是从一些包含大量\n
的PDF文件中导出的。我试图在模式\n \n\n
首次出现之前停止匹配,但是正则表达式总是向我返回更多字符。
这是一个例子。
字符串是:
contents = '\n\n\n\n\n\n\n\nClient: ABC area: Location Mc\nHole: 33-44 \n \n\n \n\nKJK TechCen Rep # 5243 \n \n\n \n\n95 \n\nTable 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V \n% \n\nLiq/To \n% \n\nLiq/Sat \nBu \n\nDenCom'
我使用的正则表达式是:
re.findall(r'Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+ [^ \n \n\n ]', contents)
我希望结果字符串从'Table XXX'
开始并在第一个' \n \n\n '
之前结束,如下所示:
'Table 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF '
但是我得到的实际字符串是:
'Table 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V'
那么我该如何修改正则表达式以摆脱烦人的'\n \n\n PressRel V'
?
答案 0 :(得分:1)
您可以使用肯定的前瞻 private short _oCommonOutputsPort;
public short oCommonOutputsPort
{
get { return _oCommonOutputsPort; }
set
{
SetField(ref _oCommonOutputsPort, value);
oCommonOpLampRed = (oCommonOutputsPort & (1 << obitCommonOpLampRed)) != 0;
oCommonOpLampGreen = (oCommonOutputsPort & (1 << obitCommonOpLampGreen)) != 0;
oCommonMuteA = (oCommonOutputsPort & (1 << obitCommonMuteA)) != 0;
oCommonMuteB = (oCommonOutputsPort & (1 << obitCommonMuteB)) != 0;
}
}
private bool _oCommonOpLampRed;
public bool oCommonOpLampRed
{
get { return _oCommonOpLampRed; }
set { SetField(ref _oCommonOpLampRed, value); }
}
private bool _oCommonOpLampGreen;
public bool oCommonOpLampGreen
{
get { return _oCommonOpLampGreen; }
set { SetField(ref _oCommonOpLampGreen, value); }
}
}
private bool _oCommonMuteA;
public bool oCommonMuteA
{
get { return _oCommonMuteA; }
set { SetField(ref _oCommonMuteA, value); }
}
private bool _oCommonMuteB;
public bool oCommonMuteB
{
get { return _oCommonMuteB; }
set { SetField(ref _oCommonMuteB, value); }
}
...
而不是使用字符类来断言应该紧随其后的是在右侧。
(?=
或者您可以在组中捕获值并匹配以下换行符
Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n )
答案 1 :(得分:1)
您需要一个非贪婪的+?
而不是+
,因为末尾序列中出现的所有字符都在中间的括号中。
end = r' \n \n\n '
result = re.findall(r'Table[^:]*:[a-zA-Z0-9 :&–=\n%@,()°-]+?' + end, contents)
#result = ['Table 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n ']
# to chop off the end, if needed:
result = [x[:-len(end)] for x in result]
示例中的[^ \n \n\n ]
部分等于[^ \n]
,“不是换行符或空格的字符”