如何从结构化字符串中提取出特定字段?

时间:2017-03-18 16:39:52

标签: python regex

我现在正试图通过正则表达式从结构化字符串中提取文本。 例如,

private

我想要的只是field3的值'b'和'd'

我尝试使用regex = string = "field1:afield3:bfield2:cfield3:d"

并用它分割原始字符串。

但我有这个:

"(field1:.*?)?(field2:.*?)?field3:"

那么,解决方案是什么?

真实案例是:

['', 'field1:a', None, 'b', None, 'field2:c', 'd']

(一行字符串,没有\ n)

列表

string = "1st sentence---------------------- Forwarded by Michelle 
Cash/HOU/ECT on ---------------------------Ava Syon@ENRON To: Michelle 
Cash/HOU/ECT@ECTcc: Twanda Sweet/HOU/ECT@ECT Subject: 2nd sentence---------
------------- Forwarded by Michelle Cash/HOU/ECT on -----------------------
----Ava Syon@ENRON To: Michelle Cash/HOU/ECT@ECTcc: Twanda 
Sweet/HOU/ECT@ECT Subject: 3rd sentence"

是需要的结果

谢谢!

3 个答案:

答案 0 :(得分:1)

您可以使用positive lookbehind。它会在field3之后直接找到任何字符:

>>> import re
>>> string = "field1:afield3:bfield2:cfield3:d"
>>> re.findall(r'(?<=field3:).', string)
['b', 'd']

这只适用于单个字符。我会添加一个积极的lookeahead,但它将成为与Wiktor相同的答案。

所以这是re.split()的另一种选择:

>>> string = "field1:afield3:boatfield2:cfield3:dolphin"
>>> elements = re.split(r'(field\d+:)',string)
>>> [elements[i+1] for i, x in enumerate(elements) if x == 'field3:']
['boat', 'dolphin']

答案 1 :(得分:1)

使用

re.findall(r'field3:(.*?)(?=field\d+:|$)', s)

请参阅regex demo。注意:re.findall返回捕获组的内容,因此,您不需要模式中的lookbehind,捕获组将执行此操作。

正则表达式匹配:

  • field3: - 文字字符序列
  • (.*?) - 除了换行符之外的任何0 +字符(如果使用re.DOTALL修饰符,则点也会与换行符匹配)
  • (?=field\d+:|$) - 需要(但不消费,不会添加到匹配或捕获)的正向前瞻field,1 +位,:或结束的存在当前位置后的字符串。

Python demo

import re
rx = r"field3:(.*?)(?=field\d+:|$)"
s = "field1:afield3:b and morefield2:cfield3:d and here"
res = re.findall(rx, s)
print(res)
# => ['b and more', 'd and here']

注意:相同正则表达式的更高效(展开)版本是

field3:([^f]*(?:f(?!ield\d+:)[^f]*)*)

请参阅regex demo

答案 2 :(得分:1)

使用内置str.replace()str.split()str.startswith()函数,通过字段编号获取字段值的复杂解决方案:

def getFieldValues(s, field_number):
    delimited = s.replace('field', '|field')  # setting delimiter between fields
    return [i.split(':')[1] for i in delimited.split('|') if i.startswith('field' + str(field_number))]

s = "field1:a hello-againfield3:b some textfield2:c another textfield3:d and data"

print(getFieldValues(s, 3))
# ['b some text', 'd and data']

print(getFieldValues(s, 1))
# ['a hello-again']

print(getFieldValues(s, 2))
# ['c another text']