刚刚发现我的文件结构可能不同而且我的正则表达式有时只是因为这种变化而起作用。我的正则表达是
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)
它目前与文件的以下部分匹配。
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
ACTIVITY?
PDEV
ENTER OUTPUT DEVICE CODE:
0 FOR NO OUTPUT
1 FOR PROGRESS WINDOW
但是文件的该部分有时如下所示
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.742 13.2060 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.916 1.8367 11 EL PASO 110
70187 [FTGARLND69.0] 0.936 19.6099 70 PSCOLORADO 710
73216 [WINDRIVR 115] 0.858 3.6100 73 WAPA R.M. 750
(VFSCAN) AT TIME = 20.0000 UP TO 100 BUSES WITH LOW FREQUENCY BELOW 59.600:
X ----- BUS ------ X FREQ X ----- BUS ------ X FREQ
12063 [ROSEBUD 13.8] 59.506
在这两种情况下,我只想捕捉以下部分:
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
无论我看到哪个版本的文件,我的正则表达式如何返回上面的部分?
答案 0 :(得分:1)
这应该有效
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)
答案 1 :(得分:1)
我不建议使用正则表达式,而是做一些解析。假设您的数据位于名为data
的字符串中:
lines = [line for line in data.split("\n")]
# find start of header
for index, line in enumerate(lines):
if "LOW VOLTAGE SUMMARY BY AREA" in line:
start_index = index
break
# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
if line.strip() and line.split()[0].isdigit():
first_entry_index = start_index + index
break
# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
# we don't do this inside the if because it's possible
# to end the data with only entries and whitespace
end_entry_index = first_entry_index + index
if line.strip() and not line.split()[0].isdigit():
break
# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))