我现在正试图通过正则表达式从结构化字符串中提取文本。 例如,
private
我想要的只是field3的值'b'和'd'
我尝试使用regex = string = "field1:afield3:bfield2:cfield3:d"
并用它分割原始字符串。
但我有这个:
"(field1:.*?)?(field2:.*?)?field3:"
那么,解决方案是什么?
真实案例是:
['', 'field1:a', None, 'b', None, 'field2:c', 'd']
(一行字符串,没有\ n)
列表
string = "1st sentence---------------------- Forwarded by Michelle
Cash/HOU/ECT on ---------------------------Ava Syon@ENRON To: Michelle
Cash/HOU/ECT@ECTcc: Twanda Sweet/HOU/ECT@ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon@ENRON To: Michelle Cash/HOU/ECT@ECTcc: Twanda
Sweet/HOU/ECT@ECT Subject: 3rd sentence"
是需要的结果
谢谢!
答案 0 :(得分:1)
您可以使用positive lookbehind。它会在field3
之后直接找到任何字符:
>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']
这只适用于单个字符。我会添加一个积极的lookeahead,但它将成为与Wiktor相同的答案。
所以这是re.split()
的另一种选择:
>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']
答案 1 :(得分:1)
使用
re.findall(r'field3:(.*?)(?=field\d+:|$)', s)
请参阅regex demo。注意:re.findall
返回捕获组的内容,因此,您不需要模式中的lookbehind,捕获组将执行此操作。
正则表达式匹配:
field3:
- 文字字符序列(.*?)
- 除了换行符之外的任何0 +字符(如果使用re.DOTALL
修饰符,则点也会与换行符匹配)(?=field\d+:|$)
- 需要(但不消费,不会添加到匹配或捕获)的正向前瞻field
,1 +位,:
或结束的存在当前位置后的字符串。import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']
注意:相同正则表达式的更高效(展开)版本是
field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)
请参阅regex demo
答案 2 :(得分:1)
使用内置str.replace()
,str.split()
和str.startswith()
函数,通过字段编号获取字段值的复杂解决方案:
def getFieldValues(s, field_number):
delimited = s.replace('field', '|field') # setting delimiter between fields
return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]
s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"
print(getFieldValues(s, 3))
# ['b some text', 'd and data']
print(getFieldValues(s, 1))
# ['a hello-again']
print(getFieldValues(s, 2))
# ['c another text']