我对正则表达式相对较新(由于某些原因总是挣扎)...
我有这种形式的文字:
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
解析文本显示以下结构:
我的问题是:
如何使用这些知识(和正则表达式)编写一个解析相似文本的函数来返回感兴趣的变量(上面列出的1-5)?
我想写的函数的伪代码..
def grok_directors_dealings_text(text_input):
name, title, transaction_type, lot_size, price = (None, None, None, None, None)
....
name = ...
title = ...
transaction_type = ...
lot_size = ...
price = ...
pass
如果在符合我上面确定的结构的文本中传递函数时,如何使用正则表达式来实现函数以返回感兴趣的变量?
[[编辑]]
出于某种原因,我似乎在一段时间内与正则表达式斗争,如果我要在这里学习正确的答案,那么如果能解释为什么这个神奇的表达方式会有更好的表现(对不起) ,regexpr)实际上有效。
我想真正学习这些东西而不是复制粘贴表达式......
答案 0 :(得分:2)
您可以使用以下正则表达式:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
<强>的Python:强>
import re
financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""
print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))
<强>输出:强>
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]
编辑1
要了解其含义和含义,请按照右侧的DEMO链接,找到一个块,说明每个字符的含义如下:
同样Debuggex通过显示哪个组匹配哪个字符来帮助您模拟字符串!
这是针对您的特定情况的debuggex演示:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
答案 1 :(得分:1)
我想出了这个正则表达式:
([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p
基本上,我们使用括号来捕获您想要的重要信息,所以让我们检查一下:
([\w ]+)
:\w
匹配任何单词字符[a-zA-Z0-9_]一次或多次,这将为我们提供该人的姓名; ([\w ]+)
以空格和逗号后的另一个获得标题; (sold post-exercise|sold|bought|exercised)
然后我们搜索我们的交易类型。请注意,我将post-exercise
放在post
之前,以便它首先尝试匹配较大的字词; ([\d,\.]+)
然后我们尝试找到由数字(\d
)组成的数字,一个逗号,也可能会出现一个点; ([\d\.,]+)
然后我们需要达到与交易规模基本相同的价格。连接每个组的正则表达式也非常基础。
如果您在regex101上尝试它,它会提供有关正则表达式的一些解释,并在python中生成此代码以供使用:
import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')
test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."
re.findall(p, test_str)
答案 2 :(得分:0)
这是正确的正则表达式
(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)
你像这样使用它
import re
def get_data(line):
pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
m = re.match(pattern, line)
return m.groups()
第一行将返回
('David Meredith','财务总监','运动后卖','15,000','1044.00') 编辑: 添加说明
这个正则表达式的工作原理如下 第一个字符(。*?),意思是 - 取字符串直到下一个匹配(女巫是,)
。意味着每个角色
*表示它可以多次(许多字符而不仅仅是1)
?意味着不要贪婪,这意味着它将使用第一个','和另一个(如果有很多',')
之后又有了这个(。*?) 再次拍摄角色,直到下一个要匹配的东西(用常数字)
之后有(运动后出售)女巫意味着 - 找到其中一个单词(由|表示)
之后有一个。*?女巫再次意味着采取所有文本,直到下一场比赛(这次它没有被()包围,所以它不会被选为一个组,不会成为输出的一部分)([\ d |,] +)表示取一个数字(\ d)或逗号。 +代表一次或多次
再次。*?像以前一样'价格'找到'
的实际字符串'价格和最后一个([\ d |。] +)表示再次取一个数字或一个点(由于正则表达式用于'任何字符'而被转义)一次或多次
答案 3 :(得分:0)
您可以使用以下正则查找分隔符周围的字符的正则表达式:
(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p
括号中的部分将被捕获为组。
>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
... print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]