我有一个字符串列表
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
我想提取:
我能做到:
import re
pattern = r'.*(\d{4}-\d{2}-\d{2}).*with \b([^\b]+)\b.*'
matched = [re.match(pattern, x).groups() for x in my_strings]
但由于模式与"with Tom on 2015-06-30"
不匹配而失败。
如何指定正则表达式模式与日期或个人在字符串中出现的顺序无关?
和
如何确保groups()
方法每次都以相同的顺序返回它们?
我希望输出看起来像这样?
[('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
答案 0 :(得分:4)
使用2个单独的正则表达式做什么呢?
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
pattern = r'.*(\d{4}-\d{2}-\d{2})'
dates = [re.match(pattern, x).groups()[0] for x in my_strings]
pattern = r'.*with (\w+).*'
persons = [re.match(pattern, x).groups()[0] for x in my_strings]
output = zip(dates, persons)
print output
## [('2002-03-04', 'Matt'), ('2016-01-23', 'Mary'), ('2015-06-30', 'Tom')]
答案 1 :(得分:2)
这应该有效:
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30",
]
import re
alternates = r"(?:\b(\d{4}-\d\d-\d\d)\b|with (\w+)|.)*"
for tc in my_strings:
print(tc)
m = re.match(alternates, tc)
if m:
print("\t", m.group(1))
print("\t", m.group(2))
输出是:
$ python test.py
2002-03-04 with Matt
2002-03-04
Matt
Important: 2016-01-23 with Mary
2016-01-23
Mary
with Tom on 2015-06-30
2015-06-30
Tom
但是,这样的事情并不完全直观。我鼓励您尽可能使用named groups。
答案 2 :(得分:2)
仅出于教育原因,非正则表达式方法可能涉及在“模糊”模式下使用dateutil
解析器来提取日期,并使用nltk
toolkit named entity recognition来提取名称。完整代码:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
from dateutil.parser import parse
def extract_names(text):
tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(text)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)
return [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
my_strings = [
"2002-03-04 with Matt",
"Important: 2016-01-23 with Mary",
"with Tom on 2015-06-30"
]
for s in my_strings:
print(parse(s, fuzzy=True))
print(extract_names(s))
打印:
2002-03-04 00:00:00
['Matt']
2016-01-23 00:00:00
['Mary']
2015-06-30 00:00:00
['Tom']
但这可能过于复杂了。
答案 3 :(得分:2)
如果您使用Python的新 正则表达式 模块,则可以使用conditionals获取 保证匹配2件物品。
我认为这更像是执行无序匹配的标准。
(?:.*?(?:(?(1)(?!))\b(\d{4}-\d\d-\d\d)\b|(?(2)(?!))with[ ](\w+))){2}
扩展
(?:
.*?
(?:
(?(1)(?!))
\b
( \d{4} - \d\d - \d\d ) # (1)
\b
| (?(2)(?!))
with [ ]
( \w+ ) # (2)
)
){2}