我有以下文字blob:
“他们正在做芭蕾舞,”亚当埃里森说,他是一位材料科学家 该公司,看着炉工作为玻璃堆 硫磺般的热量进入周围的空气。 “太热了,好吧 2003年10月23日,玻璃很快变硬,你只能工作 用它几分钟,“他说。埃里森会知道 - 他帮了 开发他们倾倒的材料,即2003年11月19日 品牌大猩猩玻璃和2011年10月17日发现很多 智能手机,因为它坚韧,薄,和2000年11月19日 轻量级41一月1098。
我想创建一个提取所有可能日期格式的正则表达式。例如,正则表达式必须提取:
23 october 2003
19 November 2003
October 17, 2011
对于上述内容,我尝试了以下内容:
((\d+).(January|February|March|April|May|June|July|August|September|October|November|December).(\d+))
但是我不知道如何匹配空格,大小写(?:
),特别是这种格式October 17, 2011
。知道如何获得所需的先前输出吗?。
答案 0 :(得分:1)
您明确需要月份名称吗?
(?:[1-3][0-9]\s\w+|\w+\s[1-3][0-9]),?\s[0-9]+
所以
[1-3][0-9]
日期(不完全是01-31)\s\w+
空格+字) ,?
可选的逗号+空格我认为01-31会像(0[1-9]|[12][0-9]|3[01])
那样,但是你提到“可能的日期”,所以2月31日就不会“可能”......
答案 1 :(得分:1)
鉴于该文本,您可以像这样得到日期:
>>> re.findall(r'(\b(?:[1-3][0-9]\s[a-zA-Z]+\s[12][0-9]{3})|(?:[a-zA-Z]+\s[1-3][0-9],\s?[12][0-9]{3})\b)', txt)
['23 october 2003', '19 November 2003', 'October 17, 2011']
答案 2 :(得分:1)
您可以尝试这样的事情:
from dateutil import parser
import re
a = """“They’re doing a ballet,” says Adam Ellison, a materials scientist at the company, watching the furnace workers as the glass dumps brimstone-like heat into the surrounding air. “It’s hot as hell, the glass 23 october 2003 gets stiff very quickly, and you can only work with it for a few minutes,” he says. Ellison would know—he helped develop the material they’re pouring, which is 19 November 2003 branded Gorilla Glass and is October 17, 2011 found on many smartphones because it is tough, thin, and 19 November 200000003 lightweight 41 january 1098."""
b = re.findall(r'\S+ \S+ (?=\d{4}\b)\d{4}', a)
print b
tl = []
for c in b:
try:
if parser.parse(c):
tl.append(c)
except:
pass
print tl
输出:
['23 october 2003', '19 November 2003', 'October 17, 2011', '41 january 1098']
['23 october 2003', '19 November 2003', 'October 17, 2011']
虽然这不是最佳解决方案,但它有效:
from IPython.display import display as dp
import pandas as pd
import re
a="""“They’re doing a ballet,” says Adam Ellison, a materials scientist at the company, watching the furnace workers as the glass dumps brimstone-like heat into the surrounding air. “It’s hot as hell, sdkhfks BDR 1990 the glass 23 october 2003 gets stiff very quickly, and you can only work with it for a few minutes,” he says. Ellison would know—he helped develop the material they’re pouring, which is 19 November 2003 branded Gorilla Glass and is October 17, 2011 found on many smartphones because it is tough, thin, and 19 November 200000003 lightweight 41 january 1098. 31 february 1990 sdkhfks AB 1990. """
def foo(a):
b = re.findall(r'\S+ \S+ (?=\d{4})\d{4}\b', a)
tl = []
for c in b:
try:
if pd.tseries.tools.parse_time_string(c):
tl.append(c)
except:
pass
return tl
df = pd.DataFrame(data={'c1': [a, a]})
dp(df)
df['valid_dates'] = df.c1.apply(lambda x: foo(str(x)))
dp(df)
输出: