Question

我有一个带有文字的df栏，我试图从中提取不同的日期模式。

这个df1例如：

<index>    text    
0          My birthday is 10/23/89.
1          Christmas is on December 25th.
2          Thanksgiving of 11/2008 was the best.

所需的输出是第3列，称为dates：

<index>    text                                  dates
0          My birthday is 10/23/89.               10/23/89
1          Christmas is on December 25.           25 December
2          Thanksgiving of 11/2008 was the best.  11/2008

为了拉出我的第一个约会，我写了第一个表达式，就像这个：

df1 [＆＃39; date＆＃39;] =（df1 [＆＃39; text＆＃39;]。str.findall（r＆＃39; \ d {1,2} [/ - ] \ d { 1,2} [/ - ] \ d {2,4}＆＃39;））

那就是我被卡住的地方。

我不知道/了解如何编写多个重复表达式，而不是继续写下df1 [＆＃39; date＆＃39;]列中的内容。

我想要运行我的下一个表达式：

df1['dates'] = df1['text'].str.findall(r'(?:\d{1,2})?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}')

查看df['dates']列是否为空，然后尝试下一个重新表达式的方式或最佳方式是什么？

我今天早些时候问了这个问题，它被标记为this的可能副本，但我认为DeepSpace认为我比我真的更聪明，我的问题比他回答的要简单得多。 / p>

Answer 1

你可以尝试

df['dates'] = df['text'].str.extract('.*?(\d+/\d+/?\d*).*?')


    text                                    dates
0   My birthday is 10/23/89.                10/23/89
1   Christmas is 12/25.                     12/25
2   Thanksgiving of 11/2008 was the best.   11/2008

添加测试用例：

df['text'].str.extract('.*?(\d+/\d+/?\d*).*?|\
(January|February|March|April|May|June|July|August|September|October|November|December \d+)', expand = False)\
.fillna('').sum(1)

你得到了

0       10/23/89
1    December 25
2        11/2008

在df列上迭代不同的正则表达式模式

1 个答案: