我目前正在尝试从AER解析Python中的文本文件,该文件显示了艾伯塔省每日发放的钻井许可证。基本上,我想根据文件标题中显示的类型(井名称,唯一标识符,许可证编号等)分离每个许可证的数据,然后将每个许可证添加到列表中,然后可以将其移至数据库中。
问题是有问题的文本文件的格式(请参阅下文以了解其一部分)对解析不是特别友好。没有定界符,并且它是人类可读的。我对字符串操作的经验有限,而且我不知道如何解决该问题。
以下是相关文本文件的摘要:
DATE: 02 July 2019
--------------------------------------------------------------------------------------------
WELL NAME LICENCE NUMBER MINERAL RIGHTS GROUND ELEVATION
UNIQUE IDENTIFIER SURFACE CO-ORDINATES BOARD FIELD CENTRE PROJECTED DEPTH
LAHEE CLASSIFICATION FIELD TERMINATING ZONE
DRILLING OPERATION WELL PURPOSE WELL TYPE SUBSTANCE
LICENSEE SURFACE LOCATION
--------------------------------------------------------------------------------------------
MEG K7N HARDY 4-7-77-5 0483923 ALBERTA CROWN 571.7M
106/04-07-077-05W4/02 S 572.4M W 278.3M BONNYVILLE 1600.0M
DEV (NC) HARDY MCMURRAY FM
HORIZONTAL RESUMPTIONPRODUCTION (SCHEME) CRUDE BITUMEN
MEG ENERGY CORP. 09-07-077-05W4
SPL 11-24 HZ MARTEN 14-25-76-6 0494994 ALBERTA CROWN 705.3M
100/14-25-076-06W5/00 S 566.0M E 800.6M ST. ALBERT 2700.0M
OUT (C) MARTEN CLEARWATER FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
SPUR PETROLEUM LTD. 11-24-076-06W5
SPL 10-24 HZ MARTEN 5-23-76-6 0494995 ALBERTA CROWN 705.5M
100/05-23-076-06W5/00 S 566.3M W 800.1M ST. ALBERT 2700.0M
OUT (C) MARTEN CLEARWATER FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
SPUR PETROLEUM LTD. 10-24-076-06W5
SURGE ENERGY HZ103 VALHALLA 6-7-75-8 0494996 ALBERTA CROWN 770.8M
103/06-07-075-08W6/00 S 372.0M E 324.5M GRANDE PRAIRIE 3350.0M
DEV (NC) VALHALLA DOIG FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
SURGE ENERGY INC. 13-06-075-08W6
CNRL ET AL HZ KARR 4-16-66-3 0494997 ALBERTA CROWN 770.7M
100/04-16-066-03W6/00 N 623.4M E 127.5M GRANDE PRAIRIE 5295.0M
DEV (NC) KARR DUNVEGAN FM
HORIZONTAL NEW PRODUCTION CRUDE OIL
CANADIAN NATURAL RESOURCES LIMITED 05-14-066-03W6
我不需要虚线之间的标题信息或日期。我需要从每个行的每个部分的每个块中提取文本,如标题所示。我尝试了一些方法,包括使用Python和RegEx进行基本的字符串操作,但没有一个方法能解决这个问题,我无所适从。.如果您需要更多详细信息来说明此任务,请告诉我,我知道这是一个很大的问题,有点令人费解。
答案 0 :(得分:1)
此表达式或其中的某些派生词可能会提取所需的数据:
[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}
但是,如果我们在通过正则表达式传递标头之前将其删除,则可能会更好。
如果您有兴趣,可以在this demo的右侧面板中进一步解释该表达式。
import re
regex = r"[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}"
test_str = (" DATE: 02 July 2019 \n\n\n"
" -------------------------------------------------------------------------------------------- \n"
" WELL NAME LICENCE NUMBER MINERAL RIGHTS GROUND ELEVATION \n"
" UNIQUE IDENTIFIER SURFACE CO-ORDINATES BOARD FIELD CENTRE PROJECTED DEPTH \n"
" LAHEE CLASSIFICATION FIELD TERMINATING ZONE \n"
" DRILLING OPERATION WELL PURPOSE WELL TYPE SUBSTANCE \n"
" LICENSEE SURFACE LOCATION \n"
" -------------------------------------------------------------------------------------------- \n\n"
" MEG K7N HARDY 4-7-77-5 0483923 ALBERTA CROWN 571.7M \n"
" 106/04-07-077-05W4/02 S 572.4M W 278.3M BONNYVILLE 1600.0M \n"
" DEV (NC) HARDY MCMURRAY FM \n"
" HORIZONTAL RESUMPTIONPRODUCTION (SCHEME) CRUDE BITUMEN \n"
" MEG ENERGY CORP. 09-07-077-05W4 \n\n"
" SPL 11-24 HZ MARTEN 14-25-76-6 0494994 ALBERTA CROWN 705.3M \n"
" 100/14-25-076-06W5/00 S 566.0M E 800.6M ST. ALBERT 2700.0M \n"
" OUT (C) MARTEN CLEARWATER FM \n"
" HORIZONTAL NEW PRODUCTION CRUDE OIL \n"
" SPUR PETROLEUM LTD. 11-24-076-06W5 ")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
[上面的表达式]没有锚定并导致很多 回溯。也许用
^[ \t]*
锚定它可以使它有点 更有效率。
^[ \t]*[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}
根据当前示例数据,这也可能是一个选项
^[ \t]*[A-Z]+(?: [A-Z0-9-]+)+[ \t]+[0-9]{7}[ \t]+.*(?:\r?\n(?![ \t]*$).*)*