Question

我对正则表达式相对较新（由于某些原因总是挣扎）...

我有这种形式的文字：

David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...

Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...

Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...

解析文本显示以下结构：

开始句子的两个或多个单词，以及第一个逗号之前的单词是交易中涉及的人的姓名
以前的一个或多个单词（＆＃39;已售出＆＃39; |＆＃39;已购买＆＃39; |＆＃39;已行使＆＃39; |＆＃39;＆＃39;在运动后出售＆＃39;）是该人的头衔
其中任何一种的存在:(＆＃39;已售出＆＃39; |＆＃39;已购买＆＃39; |＆＃39;行使＆＃39; |＆＃39;在运动后出售＆＃39;）在标题之后，标识交易类型
交易类型后面的第一个数字字符串（＆＃39;已售出＆＃39; |＆＃39;已购买＆＃39; |＆＃39;行使＆＃39; |＆＃39;运动后出售＆＃39;）表示交易的大小
＆＃39;价格＆＃39;在数字字符串之前，指定交易达成的价格。

我的问题是：

如何使用这些知识（和正则表达式）编写一个解析相似文本的函数来返回感兴趣的变量（上面列出的1-5）？

我想写的函数的伪代码..

def grok_directors_dealings_text(text_input):
    name, title, transaction_type, lot_size, price = (None, None, None, None, None)
    ....
    name = ...
    title = ...
    transaction_type = ...
    lot_size = ...
    price = ...

    pass

如果在符合我上面确定的结构的文本中传递函数时，如何使用正则表达式来实现函数以返回感兴趣的变量？

[[编辑]]

出于某种原因，我似乎在一段时间内与正则表达式斗争，如果我要在这里学习正确的答案，那么如果能解释为什么这个神奇的表达方式会有更好的表现（对不起），regexpr）实际上有效。

我想真正学习这些东西而不是复制粘贴表达式......

Answer 1

您可以使用以下正则表达式：

(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)

DEMO

<强>的Python：

import re

financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...

Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...

Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""

print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))

<强>输出：

[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]

编辑1

要了解其含义和含义，请按照右侧的DEMO链接，找到一个块，说明每个字符的含义如下：

同样Debuggex通过显示哪个组匹配哪个字符来帮助您模拟字符串！

这是针对您的特定情况的debuggex演示：

(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)

Regular expression visualization

Debuggex Demo

Answer 2

我想出了这个正则表达式：

([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p

Regular expression visualization

Debuggex Demo

基本上，我们使用括号来捕获您想要的重要信息，所以让我们检查一下：

([\w ]+)：\w匹配任何单词字符[a-zA-Z0-9_]一次或多次，这将为我们提供该人的姓名;
([\w ]+)以空格和逗号后的另一个获得标题;
(sold post-exercise|sold|bought|exercised)然后我们搜索我们的交易类型。请注意，我将post-exercise放在post之前，以便它首先尝试匹配较大的字词;
([\d,\.]+)然后我们尝试找到由数字（\d）组成的数字，一个逗号，也可能会出现一个点;
([\d\.,]+)然后我们需要达到与交易规模基本相同的价格。

连接每个组的正则表达式也非常基础。

如果您在regex101上尝试它，它会提供有关正则表达式的一些解释，并在python中生成此代码以供使用：

import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')

test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."

re.findall(p, test_str)

Answer 3

这是正确的正则表达式

(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)

你像这样使用它

import re
def get_data(line):
    pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
    m = re.match(pattern, line)
    return m.groups()

第一行将返回

（'David Meredith'，'财务总监'，'运动后卖'，'15,000'，'1044.00'）编辑：添加说明

这个正则表达式的工作原理如下第一个字符（。*？），意思是 - 取字符串直到下一个匹配（女巫是，）

。意味着每个角色

*表示它可以多次（许多字符而不仅仅是1）

？意味着不要贪婪，这意味着它将使用第一个'，'和另一个（如果有很多'，'）

之后又有了这个（。*？）再次拍摄角色，直到下一个要匹配的东西（用常数字）

之后有（运动后出售）女巫意味着 - 找到其中一个单词（由|表示）

之后有一个。*？女巫再次意味着采取所有文本，直到下一场比赛（这次它没有被（）包围，所以它不会被选为一个组，不会成为输出的一部分）

（[\ d |，] +）表示取一个数字（\ d）或逗号。 +代表一次或多次

再次。*？像以前一样

'价格'找到'

的实际字符串'价格

和最后一个（[\ d |。] +）表示再次取一个数字或一个点（由于正则表达式用于'任何字符'而被转义）一次或多次

Answer 4

您可以使用以下正则查找分隔符周围的字符的正则表达式：

(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p

括号中的部分将被捕获为组。

>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
...     print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]

Python正则表达式解析财务数据

4 个答案: