使用Python匹配大型文本文件的一部分

时间:2012-09-03 04:29:49

标签: python regex

我一直在手动搜索通过运行程序生成的大文件。我已经成功地提取了一些信息块但我仍然试图提取最后三个块。块的结构如下:

尝试过多次重复表达但没有成功,例如:

v2 = re.findall(r'(?s)\(VFSCAN\) AT TIME =(.*?)100 BUSES WITH LOW VOLTAGE DEVIATION BELOW.*?\s*$',wholefile)

wholefile是我读过的整个文件。该文件包含以下每个部分中的几个部分,我想将它们全部解压缩,以便我可以找到最后一个条目,例如({{1} })。然后我将用时间解析该行以确定何时发生这种情况。我必须对“电压偏差”“电压”和“频率”做同样的事情。如果我找到如何匹配一个可变长度,多行部分,它应该是相同的其他部分。我的问题是知道何时结束搜索。我使用的事实是搜索应该在最后一个空白行结束(因此我使用18436 [LENZIE 618.0] -0.245)。我使用findall提取所有这些部分的电压偏差,例如。

我在python中对模式的VERBOSE定义也存在问题。我似乎没有工作(下面)。我做错了吗?

\s*$

下一天 在尝试了几个小时后,我想出了以下内容。第一个匹配所有(不是我需要的),第二个我确定会工作与我的测试文件不匹配。

首先:

(VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:

X ----- BUS ------ X    VDEV       X ----- BUS ------ X    VDEV
18436 [LENZIE 618.0]   -0.245      18433 [LENZIE 318.0]   -0.245     
18431 [LENZIE 118.0]   -0.214      18432 [LENZIE 218.0]   -0.214     
18435 [LENZIE 518.0]   -0.214      18434 [LENZIE 418.0]   -0.214     

(VFSCAN) AT TIME =  2.6267 UP TO  100 BUSES WITH LOW VOLTAGE BELOW  0.700:

X ----- BUS ------ X    VOLT       X ----- BUS ------ X    VOLT
65191 [BONANZA 24.0]    0.439      65194 [CHAPITA  138]    0.581     
65192 [BONANZA  138]    0.585      65371 [COVE TP  138]    0.694     
66278 [RANGELY  138]    0.698     

(VFSCAN) AT TIME =  6.0632 UP TO  100 BUSES WITH LOW FREQUENCY BELOW 59.600:

X ----- BUS ------ X    FREQ       X ----- BUS ------ X    FREQ
27117 [WTGCP   .600]   59.443      27123 [WTGGE2  .570]   59.490     
27119 [WTGGE   .570]   59.492      26040 [INTERM2G26.0]   59.492     
26039 [INTERM1G26.0]   59.492     

pattern = r"""
(?s)                                                            # Tell Regex to span multiple lines
\(VFSCAN\).*100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:  # Literal string to serach for
(\s*$).*?                                                        # This search for an empty line
X ----- BUS ------ X    VOLT       X ----- BUS ------ X    VOLT   # Literal string to search            (\d{5}.*).*?                                                         # Multiple lines starting with numbers
\s*$                                                                 # This search ends with an empty line
"""
regex = re.compile(pattern, re.VERBOSE)

第二

(?s)^\(VFSCAN\).*100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:.*(\s*$)?

使用这些正则表达式,我试图完全匹配文件的以下部分。

(?m)(?s)^\(VFSCAN\).*100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:^\s*$^X ----- BUS ------ X    VDEV.*?
(.*?)
^\s*$

我需要一些帮助来修复模式,以便我可以选择上面的内容。

我对以下文字有疑问。我只想在所有方括号“[]”中提取时间和相关项目。

(VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:

X ----- BUS ------ X    VDEV       X ----- BUS ------ X    VDEV
18436 [LENZIE 618.0]   -0.245      18433 [LENZIE 318.0]   -0.245     
18431 [LENZIE 118.0]   -0.214      18432 [LENZIE 218.0]   -0.214     
18435 [LENZIE 518.0]   -0.214      18434 [LENZIE 418.0]   -0.214     

当我使用findall和我得到的模式时。

test3 = r'''(VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -    0.200:

X ----- BUS ------ X    VDEV       X ----- BUS ------ X    VDEV
18436 [LENZIE 618.0]   -0.245      18433 [LENZIE 318.0]   -0.245     
18431 [LENZIE 118.0]   -0.214      18435 [LENZIE 518.0]   -0.214     
18434 [LENZIE 418.0]   -0.214      18432 [LENZIE 218.0]   -0.214     

(VFSCAN) AT TIME =  1.5167 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:

X ----- BUS ------ X    VDEV       X ----- BUS ------ X    VDEV
69036 [DNLP2G21.575]   -0.414      69038 [DNLP2G22.575]   -0.414     
69040 [DNLP2G23.575]   -0.414      69032 [DNLP1_G1.575]   -0.402     
65460 [DIFICULT 230]   -0.384      69027 [7MIHL G1.575]   -0.355     
69076 [HORIZ_G .575]   -0.303      67237 [MEDBOWCO 115]   -0.301     
67940 [STNDPSVC 230]   -0.300      65976 [MINERS  34.5]   -0.294     
65585 [FT CRK1 34.5]   -0.261      65584 [FT CRK2 34.5]   -0.261     
69073 [HIPLN_G .575]   -0.214     

(VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:

X ----- BUS ------ X    VDEV       X ----- BUS ------ X    VDEV
65191 [BONANZA 24.0]   -0.572      65192 [BONANZA  138]   -0.434     
65194 [CHAPITA  138]   -0.433      66278 [RANGELY  138]   -0.320     
65371 [COVE TP  138]   -0.302      79265 [CALAMRDG 138]   -0.286     
79400 [DES.MINE 138]   -0.285      65086 [ASHLEY  69.0]   -0.284     
79067 [VERNAL   138]   -0.277      67257 [MOONLAK269.0]   -0.268     
67256 [MOONLAK169.0]   -0.266      79264 [W.RV.CTY 138]   -0.206     

'''

我应该在列表中获得超过30个匹配的元组。

2 个答案:

答案 0 :(得分:2)

正则表达式提取字段

\(VFSCAN\)[^=]*=\s*    # first line of a section: (VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200
(\d*(?:\.\d+)?)        # group 1 - first number of first line: 1.1800
\D+
(\d+)                  # group 2 - second number of first line: 100
[^\d-]+
(-?\d*(?:\.\d+)?)      # group 3 - last number of first line: -0.200
\D+                    # skip second line
(?:                    # a data line: 18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245
  (?:                  # a data entry: 18436 [LENZIE 618.0] -0.245
    (\d+)              # group 4 - first number in an entry: 18436
    \s+\[
    (.*?)              # group 5 - words in brackets: LENZIE
    (-?\d*(?:\.\d+)?)  # group 6 - number in brackets: 618.0
    \]\s*
    (\S*)              # group 7 - last number (VDEV): -0.245
    \s*
  )+
  (?=[\r\n\s]+|$)
)+

BUSES WITH LOW VOLTAGE DEVIATION BELOW介于第2组和第3组([^\d-]+)之间。因此,您可以执行以下操作之一:

选项1

您也可以捕捉此部分以便稍后检查它是否是您想要的部分。只需在其周围添加parantheses,使其成为第3个捕获组:

[^\d-]+ => ([^\d-]+)

选项2

或者您可以更改正则表达式的相同部分以仅匹配所需的部分。在这种情况下,正则表达式仅匹配指定的部分而不是每个部分:

[^\d-]+ => \s+BUSES\s+WITH\s+LOW\s+VOLTAGE\s+DEVIATION\s+BELOW\s+

如果你想匹配两行:

BUSES WITH LOW VOLTAGE DEVIATION BELOW
BUSES WITH LOW FREQUENCY BELOW

然后,您可以使用替代(|)语法编写更改部分((?:...)表示不捕获此组):

[^\d-]+ => \s+BUSES\s+WITH\s+LOW\s+(?:VOLTAGE\s+DEVIATION|FREQUENCY)\s+BELOW\s+

绩效改进

捕获群组

可以移除不必要的捕获组,例如(xyz) => xyz,或以这种方式取消捕获:(xyz) => (?:xyz)

不必要的可选项

.*更改为.+可能会导致性能提升。

改进的正则表达式

以下正则表达式是上述正则表达式的改进版本:

\(VFSCAN\)[^=]*=\s*    # first line of a section: (VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200
(\d*(?:\.\d+)?)        # group 1 - first number of first line: 1.1800
\D+
\d+                    # second number of first line: 100
[^\d-]+
-?\d*(?:\.\d+)?        # last number of first line: -0.200
\D+                    # skip second line
(?:                    # a data line: 18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245
  (?:                  # a data entry: 18436 [LENZIE 618.0] -0.245
    \d+                # first number in an entry: 18436
    \s+\[
    (.+?)              # group 2 - words in brackets: LENZIE
    -?\d*(?:\.\d+)?    # number in brackets: 618.0
    \]\s+
    \S+                # last number (VDEV): -0.245
    \s*
  )+
  (?=[\r\n\s]+|$)
)+

答案 1 :(得分:0)

您正试图将VOLTVDEV

匹配
(VFSCAN) AT TIME =  1.1800 UP TO  100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:

X ----- BUS ------ X    VDEV       X ----- BUS ------ X    VDEV

或者您尝试将-0.2000.700

匹配
(VFSCAN) AT TIME =  2.6267 UP TO  100 BUSES WITH LOW VOLTAGE BELOW  0.700:

X ----- BUS ------ X    VOLT       X ----- BUS ------ X    VOLT