用于将显示良好许可证数据的无分隔文本文件解析为列表的方法?

时间:2019-07-05 18:01:15

标签: python regex parsing text string-parsing

我目前正在尝试从AER解析Python中的文本文件,该文件显示了艾伯塔省每日发放的钻井许可证。基本上,我想根据文件标题中显示的类型(井名称,唯一标识符,许可证编号等)分离每个许可证的数据,然后将每个许可证添加到列表中,然后可以将其移至数据库中。

问题是有问题的文本文件的格式(请参阅下文以了解其一部分)对解析不是特别友好。没有定界符,并且它是人类可读的。我对字符串操作的经验有限,而且我不知道如何解决该问题。

以下是相关文本文件的摘要:





    DATE: 02 July 2019                                                                                  


    --------------------------------------------------------------------------------------------        
    WELL NAME               LICENCE NUMBER         MINERAL RIGHTS       GROUND ELEVATION                
    UNIQUE IDENTIFIER       SURFACE CO-ORDINATES   BOARD FIELD CENTRE   PROJECTED DEPTH                 
    LAHEE CLASSIFICATION    FIELD                                       TERMINATING ZONE                
    DRILLING OPERATION      WELL PURPOSE           WELL  TYPE           SUBSTANCE                       
    LICENSEE                                                            SURFACE LOCATION                
    --------------------------------------------------------------------------------------------        

    MEG K7N HARDY 4-7-77-5               0483923   ALBERTA CROWN        571.7M                          
    106/04-07-077-05W4/02  S  572.4M  W  278.3M    BONNYVILLE           1600.0M                         
    DEV (NC)                             HARDY                          MCMURRAY FM                     
    HORIZONTAL                           RESUMPTIONPRODUCTION (SCHEME)  CRUDE BITUMEN                   
    MEG ENERGY CORP.                                                    09-07-077-05W4                  

    SPL 11-24 HZ MARTEN 14-25-76-6       0494994   ALBERTA CROWN        705.3M                          
    100/14-25-076-06W5/00  S  566.0M  E  800.6M    ST. ALBERT           2700.0M                         
    OUT (C)                              MARTEN                         CLEARWATER FM                   
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SPUR PETROLEUM LTD.                                                 11-24-076-06W5                  

    SPL 10-24 HZ MARTEN 5-23-76-6        0494995   ALBERTA CROWN        705.5M                          
    100/05-23-076-06W5/00  S  566.3M  W  800.1M    ST. ALBERT           2700.0M                         
    OUT (C)                              MARTEN                         CLEARWATER FM                   
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SPUR PETROLEUM LTD.                                                 10-24-076-06W5                  

    SURGE ENERGY HZ103 VALHALLA 6-7-75-8 0494996   ALBERTA CROWN        770.8M                          
    103/06-07-075-08W6/00  S  372.0M  E  324.5M    GRANDE PRAIRIE       3350.0M                         
    DEV (NC)                             VALHALLA                       DOIG FM                         
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SURGE ENERGY INC.                                                   13-06-075-08W6                  

    CNRL ET AL HZ KARR 4-16-66-3         0494997   ALBERTA CROWN        770.7M                          
    100/04-16-066-03W6/00  N  623.4M  E  127.5M    GRANDE PRAIRIE       5295.0M                         
    DEV (NC)                             KARR                           DUNVEGAN FM                     
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    CANADIAN NATURAL RESOURCES LIMITED                                  05-14-066-03W6     

我不需要虚线之间的标题信息或日期。我需要从每个行的每个部分的每个块中提取文本,如标题所示。我尝试了一些方法,包括使用Python和RegEx进行基本的字符串操作,但没有一个方法能解决这个问题,我无所适从。.如果您需要更多详细信息来说明此任务,请告诉我,我知道这是一个很大的问题,有点令人费解。

1 个答案:

答案 0 :(得分:1)

此表达式或其中的某些派生词可能会提取所需的数据:

[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}

但是,如果我们在通过正则表达式传递标头之前将其删除,则可能会更好。


如果您有兴趣,可以在this demo的右侧面板中进一步解释该表达式。

测试

import re

regex = r"[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}"

test_str = (" DATE: 02 July 2019                                                                                  \n\n\n"
    "    --------------------------------------------------------------------------------------------        \n"
    "    WELL NAME               LICENCE NUMBER         MINERAL RIGHTS       GROUND ELEVATION                \n"
    "    UNIQUE IDENTIFIER       SURFACE CO-ORDINATES   BOARD FIELD CENTRE   PROJECTED DEPTH                 \n"
    "    LAHEE CLASSIFICATION    FIELD                                       TERMINATING ZONE                \n"
    "    DRILLING OPERATION      WELL PURPOSE           WELL  TYPE           SUBSTANCE                       \n"
    "    LICENSEE                                                            SURFACE LOCATION                \n"
    "    --------------------------------------------------------------------------------------------        \n\n"
    "    MEG K7N HARDY 4-7-77-5               0483923   ALBERTA CROWN        571.7M                          \n"
    "    106/04-07-077-05W4/02  S  572.4M  W  278.3M    BONNYVILLE           1600.0M                         \n"
    "    DEV (NC)                             HARDY                          MCMURRAY FM                     \n"
    "    HORIZONTAL                           RESUMPTIONPRODUCTION (SCHEME)  CRUDE BITUMEN                   \n"
    "    MEG ENERGY CORP.                                                    09-07-077-05W4                  \n\n"
    "    SPL 11-24 HZ MARTEN 14-25-76-6       0494994   ALBERTA CROWN        705.3M                          \n"
    "    100/14-25-076-06W5/00  S  566.0M  E  800.6M    ST. ALBERT           2700.0M                         \n"
    "    OUT (C)                              MARTEN                         CLEARWATER FM                   \n"
    "    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       \n"
    "    SPUR PETROLEUM LTD.                                                 11-24-076-06W5                  ")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

建议

The fourth bird建议:

  

[上面的表达式]没有锚定并导致很多   回溯。也许用^[ \t]*锚定它可以使它有点   更有效率。

^[ \t]*[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}

See a demo

根据当前示例数据,这也可能是一个选项

^[ \t]*[A-Z]+(?: [A-Z0-9-]+)+[ \t]+[0-9]{7}[ \t]+.*(?:\r?\n(?![ \t]*$).*)* 

See a demo