我必须从文件中提取3个字符串,如下所示:
我只需要提取关键字“ >> For”之前的3个字符串
我编写了以下代码来提取字符串列表,但是无法正确提取:
https://pastebin.com/kRd0ecK3
上述文件的预期结果:
import re
import sys
contents = "JLYLFPMKKLZDSRLBTEKH KMZMGQNLLMAETSMCUFLI KXKEOLJJKYCRQKASDJG J LYLFPMKKLZDSRLBTEKH K MZMGQNLLMAETSMCUFLI L KXKEOLJJKYCRQKASDJGJ LYLFPMKKLZDSRLBTEKHK MZMGQNLLMAETSMCUFLIL KXKEOLJJKYCRQKASDJGJ LYLFPMKKLZDSRLBTEKHK MZMGQNLLMAETSMCUFLIL >> For"
m = re.match(r'(.*)[A-Z]{20}\s{40}(.*)\s{20}>> For', contents)
if m:
print m.group(1)
答案 0 :(得分:1)
re.findall('(\w{20}\s+\w{20}\s+\w{20}\s+)>> For', x)[0].split()
这应该返回您想要的内容:
['KXKEOLJJKYCRQKASDJGJ', 'LYLFPMKKLZDSRLBTEKHK', 'MZMGQNLLMAETSMCUFLIL']
答案 1 :(得分:1)
您可以使用此正则表达式,
([A-Z]{20})\s+([A-Z]{20})\s+([A-Z]{20})\s+>>\s*For
并捕获组1,组2和组3
示例python代码,
import re
contents = 'JLYLFPMKKLZDSRLBTEKH KMZMGQNLLMAETSMCUFLI KXKEOLJJKYCRQKASDJG J LYLFPMKKLZDSRLBTEKH K MZMGQNLLMAETSMCUFLI L KXKEOLJJKYCRQKASDJGJ LYLFPMKKLZDSRLBTEKHK MZMGQNLLMAETSMCUFLIL KXKEOLJJKYCRQKASDJGJ LYLFPMKKLZDSRLBTEKHK MZMGQNLLMAETSMCUFLIL >> For'
m = re.match(r'.*([A-Z]{20})\s+([A-Z]{20})\s+([A-Z]{20})\s+>>\s*For', contents)
if m:
print(m.group(1))
print(m.group(2))
print(m.group(3))
哪些印刷品
KXKEOLJJKYCRQKASDJGJ
LYLFPMKKLZDSRLBTEKHK
MZMGQNLLMAETSMCUFLIL
答案 2 :(得分:1)
简单而愚蠢的非正则表达式解决方案,使用不带分隔符的split
,因此它不关心换行符,空格等...
contents = "JLYLFPMKKLZDSRLBTEKH KMZMGQNLLMAETSMCUFLI KXKEOLJJKYCRQKASDJG J LYLFPMKKLZDSRLBTEKH K MZMGQNLLMAETSMCUFLI L KXKEOLJJKYCRQKASDJGJ LYLFPMKKLZDSRLBTEKHK MZMGQNLLMAETSMCUFLIL KXKEOLJJKYCRQKASDJGJ LYLFPMKKLZDSRLBTEKHK MZMGQNLLMAETSMCUFLIL >> For"
toks = contents.split()
for i in range(len(toks)-1):
if toks[i]==">>" and toks[i+1]=="For":
print(toks[i-3:i])
break
打印:
['KXKEOLJJKYCRQKASDJGJ', 'LYLFPMKKLZDSRLBTEKHK', 'MZMGQNLLMAETSMCUFLIL']