如何在Python中从这个长字符串中提取STOP_DATE
值?
GROUP = TEMPORALINFORMATION
OBJECT = PRODUCTIONDATETIME
NUM_VAL = 1
VALUE = "2015-07-19T18:29:43Z"
END_OBJECT = PRODUCTIONDATETIME
OBJECT = START_DATE
NUM_VAL = 1
VALUE = "2015-07-11T20:17:22Z"
END_OBJECT = START_DATE
OBJECT = STOP_DATE
NUM_VAL = 1
VALUE = "2015-07-11T21:03:52Z"
END_OBJECT = STOP_DATE
END_GROUP = TEMPORALINFORMATION
答案 0 :(得分:1)
正如其他人所示,您可以将其作为单行正则表达式执行,但这更清楚:
import re
input_data=""" GROUP = TEMPORALINFORMATION\n\n OBJECT = PRODUCTIONDATETIME\n NUM_VAL = 1\n VALUE = "2015-07-19T18:29:43Z"\n END_OBJECT = PRODUCTIONDATETIME\n\n OBJECT = START_DATE\n NUM_VAL = 1\n VALUE = "2015-07-11T20:17:22Z"\n END_OBJECT = START_DATE\n\n OBJECT = STOP_DATE\n NUM_VAL = 1\n VALUE = "2015-07-11T21:03:52Z"\n END_OBJECT = STOP_DATE\n\n END_GROUP = TEMPORALINFORMATION
"""
def find_stop_date(s):
in_stop_date=False
result=None
for line in s.split("\n"):
line = line.strip()
if re.search(r"^OBJECT.*=.*STOP_DATE", line):
in_stop_date=True
if re.search(r"^END_OBJECT.*=.*STOP_DATE", line):
in_stop_date=False
if in_stop_date:
re_result = re.search("VALUE\s*=\s*(.*)", line)
if (re_result):
result = re_result.group(1)
return result
result = find_stop_date(input_data)
if result:
print("Found: {}".format(result))
else:
print("not found")
答案 1 :(得分:1)
您可以使用此正则表达式:
STOP_DATE.+?VALUE\s*=\s*\"(.+?)\"
Python命令:
import re
regex = r"STOP_DATE.+?VALUE\s*=\s*\"(.+?)\""
match = re.search(regex, test_str, re.DOTALL)
print(match.group(1))
其中test_str
是字符串的名称。
结果:
2015-07-11T21:03:52Z
答案 2 :(得分:0)
Sven的答案并不尽如人意,我的模式运行速度提高了5倍,DOTALL
标志可以省略:STOP_DATE[^"]+"([^"]+)
import re
test_str = '''GROUP = TEMPORALINFORMATION
OBJECT = PRODUCTIONDATETIME
NUM_VAL = 1
VALUE = "2015-07-19T18:29:43Z"
END_OBJECT = PRODUCTIONDATETIME
OBJECT = START_DATE
NUM_VAL = 1
VALUE = "2015-07-11T20:17:22Z"
END_OBJECT = START_DATE
OBJECT = STOP_DATE
NUM_VAL = 1
VALUE = "2015-07-11T21:03:52Z"
END_OBJECT = STOP_DATE
END_GROUP = TEMPORALINFORMATION'''
print re.search( r'STOP_DATE[^"]+"([^"]+)', test_str).group(1)
# 2015-07-11T21:03:52Z
性能的提升来自使用两个贪婪的#34;否定的捕获类"而不是点。
由于所需的子字符串是唯一跟随STOP_DATE
的双引号值,因此双引号是唯一需要识别的字符。
如果您的实际数据具有双引号的其他值,并且您需要指定VALUE
,那么您可以使用:STOP_DATE[^"]+VALUE[^"]+"([^"]+)
,但所需的步骤会膨胀到我之前模式的2.5倍(但是仍然比Sven快2倍。