这个问题与this类似,我最初用这个解决方案回答了问题,但事实证明我误解了这个问题。但是,我觉得我的answer对于略有不同的用例很有用,所以我在这里发布。
给定一个文本文件:
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
包含不同格式的已提取日期...任务是将它们读入数据框然后对它们进行排序,然后以MM / DD / YYYY格式显示输出。
预期产出:
0 06/01/2008
1 01/01/2009
2 02/01/2009
3 03/20/2009
4 03/20/2009
5 03/20/2009
6 03/20/2009
7 03/20/2009
8 03/20/2009
9 03/20/2009
10 03/20/2009
11 03/20/2009
12 03/20/2009
13 03/21/2009
14 03/22/2009
15 04/03/2009
16 04/20/2009
17 04/20/2009
18 04/20/2009
19 09/01/2009
20 12/01/2009
21 01/01/2010
22 10/01/2010
如何在熊猫中完成?
注意:如果缺少这一天,请考虑第1天,如果缺少月份,请考虑1月。
答案 0 :(得分:2)
可重复设置(轻松实现MCVE):
import pandas as pd
import io
text = '''04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010'''
buf = io.stringIO(text)
df = pd.read_csv(buf, engine='python', delimiter=';\s+', header=None).reset_index()
df
index 0 1 2 \
0 04/20/2009 04/20/09 4/20/09 4/3/09
1 Mar-20-2009 Mar 20, 2009 March 20, 2009 Mar. 20, 2009
2 20 Mar 2009 20 March 2009 20 Mar. 2009 20 March, 2009
3 Mar 20th, 2009 Mar 21st, 2009 Mar 22nd, 2009 None
4 Feb 2009 Sep 2009 Oct 2010 None
5 6/2008 12/2009 None None
6 2009 2010 None None
3
0 None
1 Mar 20 2009;
2 None
3 None
4 None
5 None
6 None
将buf
替换为文本文件的名称。
您可以使用df.apply
和df.stack
,然后使用pd.Series.sort_values
。
out = df.stack().apply(pd.to_datetime)\
.reset_index(drop=1)\
.sort_values().dt.strftime('%m/%d/%Y')\
.reset_index(drop=1)
print(out)
0 06/01/2008
1 01/01/2009
2 02/01/2009
3 03/20/2009
4 03/20/2009
5 03/20/2009
6 03/20/2009
7 03/20/2009
8 03/20/2009
9 03/20/2009
10 03/20/2009
11 03/20/2009
12 03/20/2009
13 03/21/2009
14 03/22/2009
15 04/03/2009
16 04/20/2009
17 04/20/2009
18 04/20/2009
19 09/01/2009
20 12/01/2009
21 01/01/2010
22 10/01/2010
答案 1 :(得分:2)
Simplier应该只省略apply
和reset_index
一次:
在我看来,drop=1
的可读性更差,如drop=True
。
out = pd.to_datetime(df.stack()).sort_values().dt.strftime('%m/%d/%Y').reset_index(drop=True)
print(out)
0 06/01/2008
1 01/01/2009
2 02/01/2009
3 03/20/2009
4 03/20/2009
5 03/20/2009
6 03/20/2009
7 03/20/2009
8 03/20/2009
9 03/20/2009
10 03/20/2009
11 03/20/2009
12 03/20/2009
13 03/21/2009
14 03/22/2009
15 04/03/2009
16 04/20/2009
17 04/20/2009
18 04/20/2009
19 09/01/2009
20 12/01/2009
21 01/01/2010
22 10/01/2010
dtype: object