需要修改regex才能在其他情况下工作

时间:2012-10-01 14:02:48

标签: python regex

刚刚发现我的文件结构可能不同而且我的正则表达式有时只是因为这种变化而起作用。我的正则表达是     v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)

它目前与文件的以下部分匹配。

----------           LOW VOLTAGE SUMMARY BY AREA            ----------

         BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

       12006  [AMISTAD 69.0]   0.971   1.8700  10 NEW MEXICO    121
       11223  [WHITESA213.8]   0.918   1.9900  11 EL PASO       110
       70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
       70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
       79086  [PAGOSA   115]   0.937   2.0333  73 WAPA R.M.     791

ACTIVITY? 
PDEV

ENTER OUTPUT DEVICE CODE:
 0 FOR NO OUTPUT
 1 FOR PROGRESS WINDOW

但是文件的该部分有时如下所示

    ----------           LOW VOLTAGE SUMMARY BY AREA            ----------

         BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

       12006  [AMISTAD 69.0]   0.742  13.2060  10 NEW MEXICO    121
       11223  [WHITESA213.8]   0.916   1.8367  11 EL PASO       110
       70187  [FTGARLND69.0]   0.936  19.6099  70 PSCOLORADO    710
       73216  [WINDRIVR 115]   0.858   3.6100  73 WAPA R.M.     750

(VFSCAN) AT TIME = 20.0000 UP TO  100 BUSES WITH LOW FREQUENCY BELOW 59.600:

X ----- BUS ------ X    FREQ       X ----- BUS ------ X    FREQ
12063 [ROSEBUD 13.8]   59.506     

在这两种情况下,我只想捕捉以下部分:

----------           LOW VOLTAGE SUMMARY BY AREA            ----------

     BUS   NAME   BASKV    VOLT    TIME       AREA     ZONE

   12006  [AMISTAD 69.0]   0.971   1.8700  10 NEW MEXICO    121
   11223  [WHITESA213.8]   0.918   1.9900  11 EL PASO       110
   70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
   70044  [B.HYDROB4.16]   0.955   2.3233  70 PSCOLORADO    703
   79086  [PAGOSA   115]   0.937   2.0333  73 WAPA R.M.     791

无论我看到哪个版本的文件,我的正则表达式如何返回上面的部分?

2 个答案:

答案 0 :(得分:1)

这应该有效

v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)

答案 1 :(得分:1)

我不建议使用正则表达式,而是做一些解析。假设您的数据位于名为data的字符串中:

lines = [line for line in data.split("\n")]

# find start of header
for index, line in enumerate(lines):
    if "LOW VOLTAGE SUMMARY BY AREA" in line:
        start_index = index
        break

# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
    if line.strip() and line.split()[0].isdigit():
        first_entry_index = start_index + index
        break

# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
    # we don't do this inside the if because it's possible
    # to end the data with only entries and whitespace
    end_entry_index = first_entry_index + index

    if line.strip() and not line.split()[0].isdigit():
        break

# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))