答案 0 :(得分:0)
首先,您应该将Yr
转换为四位数的整数,即1961或2061。这是明确的,并且,如果使用下面的方法,则需要格式YYYY-MM-DD。那是因为熊猫在pandas / core / tools / datetimes.py中使用format='%Y%m%d'
:
# From pandas/core/tools/datetimes.py, if you pass a DataFrame or dict
values = to_datetime(values, format='%Y%m%d', errors=errors)
所以,举个例子:
from itertools import product
import numpy as np
import pandas as pd
np.random.seed(444)
datecols = ['Yr', 'Mo', 'Dy']
mapper = dict(zip(datecols, ('year', 'month', 'day')))
df = pd.DataFrame(list(product([61, 62], [1, 2], [1, 2, 3])),
columns=datecols)
df['data'] = np.random.randn(len(df))
这里是df
:
In [11]: df
Out[11]:
Yr Mo Dy data
0 61 1 1 0.357440
1 61 1 2 0.377538
2 61 1 3 1.382338
3 61 2 1 1.175549
4 61 2 2 -0.939276
5 61 2 3 -1.143150
6 62 1 1 -0.542440
7 62 1 2 -0.548708
8 62 1 3 0.208520
9 62 2 1 0.212690
10 62 2 2 1.268021
11 62 2 3 -0.807303
为简单起见,我们假设真实范围是1920年以后,即:
In [16]: yr = df['Yr']
In [17]: df['Yr'] = np.where(yr <= 20, 2000 + yr, 1900 + yr)
In [18]: df
Out[18]:
Yr Mo Dy data
0 1961 1 1 0.357440
1 1961 1 2 0.377538
2 1961 1 3 1.382338
3 1961 2 1 1.175549
4 1961 2 2 -0.939276
5 1961 2 3 -1.143150
6 1962 1 1 -0.542440
7 1962 1 2 -0.548708
8 1962 1 3 0.208520
9 1962 2 1 0.212690
10 1962 2 2 1.268021
11 1962 2 3 -0.807303
您需要做的第二件事是重命名列;如果将映射或DataFrame传递给pd.to_datetime()
,Pandas对此非常严格。这是该步骤和结果:
In [21]: df.index = pd.to_datetime(df[datecols].rename(columns=mapper))
In [22]: df
Out[22]:
Yr Mo Dy data
1961-01-01 1961 1 1 0.357440
1961-01-02 1961 1 2 0.377538
1961-01-03 1961 1 3 1.382338
1961-02-01 1961 2 1 1.175549
1961-02-02 1961 2 2 -0.939276
1961-02-03 1961 2 3 -1.143150
1962-01-01 1962 1 1 -0.542440
1962-01-02 1962 1 2 -0.548708
1962-01-03 1962 1 3 0.208520
1962-02-01 1962 2 1 0.212690
1962-02-02 1962 2 2 1.268021
1962-02-03 1962 2 3 -0.807303
最后,这是通过将列串联为字符串的一种替代方法:
In [27]: as_str = df[datecols].astype(str)
In [30]: pd.to_datetime(
...: as_str['Yr'] + '-' + as_str['Mo'] +'-' + as_str['Dy'],
...: format='%y-%m-%d'
...: )
Out[30]:
0 2061-01-01
1 2061-01-02
2 2061-01-03
3 2061-02-01
4 2061-02-02
5 2061-02-03
6 2062-01-01
7 2062-01-02
8 2062-01-03
9 2062-02-01
10 2062-02-02
11 2062-02-03
dtype: datetime64[ns]
再次提醒您,这将为您服务。如果要明确,则在定义as_str
之前,需要遵循与上述相同的方法来添加正确的世纪。
答案 1 :(得分:0)
正如布拉德所指出的,这就是我解决问题的方式
def adjustyear(x):
if x >= 1800:
x = 1900 + x
else:
x = 2000 + x
return x
def parsefunc(x):
yearmodified = adjustyear(x['Yr'])
print(yearmodified)
datetimestr = str(yearmodified)+str(x['Mo'])+str(x['Dy'])
return pd.to_datetime(datetimestr, format='%Y%m%d', errors='ignore')
data['newindex'] = data.apply(parsefunc, axis=1)
data.index = data['newindex']