numpy只在年份信息可用时将字符串转换为日期时间

时间:2017-09-20 05:51:01

标签: python pandas numpy datetime

考虑以下输入:

[['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']]   

我想要的输出是:

[[2016-12-31, 2015-12-31, 2014-12-31],
 [2016-03-31, 2015-03-31, 2014-12-31]]

基本上,我想将每个1-3 nested内的元素list转换为datetime对象,其中month信息基于元素{{ 1 {} 0

我可以想到一个手动密集型解决方案,但我正在寻找最有效的方法(速度方面)来实现这一目标。实际数据有数千行。

2 个答案:

答案 0 :(得分:1)

您可以months使用extract daysradd添加至每年的eache年份并转换为to_datetime

L = [['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']]   

a = np.array(L)
pat = '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2})'
d = pd.Series(a[:, 0]).str.extract(pat, expand=True).apply('-'.join, 1).add('-')
print (d)
0    Dec-31-
1    Mar-31-
dtype: object

L1 = pd.DataFrame(a[:, 1:]).radd(d, 0).apply(pd.to_datetime).values.astype('datetime64[D]')
print (L1)
[['2016-12-31' '2015-12-31' '2014-12-31']
 ['2016-03-31' '2015-03-31' '2014-03-31']]

如果性能很重要,请使用dictionary来映射月份:

d = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 
     'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}

L2 = []
for l in L:
    a = l[0].split()[-3:-1]
    a = '-'.join([d[a[0]], a[1]])
    L2.append([x + '-' + a for x in l[1:]])

print (L2)

[['2016-12-31', '2015-12-31', '2014-12-31'],
 ['2016-03-31', '2015-03-31', '2014-03-31']]

最后如果需要numpy array

print (np.array(L1))
[['2016-12-31' '2015-12-31' '2014-12-31']
 ['2016-03-31' '2015-03-31' '2014-03-31']]

<强>计时

L = [['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']] * 10000  


In [262]: %%timeit
     ...: d = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 
     ...:      'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
     ...: 
     ...: L2 = []
     ...: for l in L:
     ...:     a = l[0].split()[-3:-1]
     ...:     a = '-'.join([d.get(a[0]), a[1]])
     ...:     L2.append([x + '-' + a for x in l[1:]])
     ...: 
10 loops, best of 3: 44.3 ms per loop

In [263]: %%timeit
     ...: out_list=[]
     ...: for l in L:
     ...:     l_date = datetime.strptime((" ").join(l[0].split()[-3:]), '%b %d %Y')
     ...:     out_list.append([("-").join([str(l_year),str(l_date.month),str(l_date.day)])
     ...:             for l_year in l[-3:]])
     ...: 
1 loop, best of 3: 303 ms per loop

In [264]: %%timeit
     ...: a = np.array(L)
     ...: pat = '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2})'
     ...: d = pd.Series(a[:, 0]).str.extract(pat, expand=True).apply('-'.join, 1).add('-')
     ...: L1 = pd.DataFrame(a[:, 1:]).radd(d, 0).apply(pd.to_datetime).values.astype('datetime64[D]')
     ...: 
1 loop, best of 3: 7.46 s per loop

答案 1 :(得分:0)

这会将您想要的输出创建为嵌套列表

from datetime import datetime

in_list = [['Fiscal data as of Dec 31 2016', '2016', '2015', '2014'],
['Fiscal data as of Mar 31 2016', '2016', '2015', '2014']]

out_list=[]
for l in in_list:
    l_date = datetime.strptime((" ").join(l[0].split()[-3:]), '%b %d %Y')
    out_list.append([("-").join([str(l_year),str(l_date.month),str(l_date.day)])
            for l_year in l[-3:]])