人格式日期范围解析

时间:2014-11-10 12:05:44

标签: python pandas

我有人格式日期范围:

dt = pd.Series(['27.02-11.03.2014', '10-11.06.2014'])

我希望得到DataFrame,其中包含当前使用的事件开始结束日期:

tmp = dt.str.split('-').apply(lambda x: pd.Series(x, index=['start', 'end'])).apply(lambda x: pd.to_datetime(x, dayfirst=True))

def dt_parse(dt):
    x, y = dt
    if len(x) > 2:
        t = x.split('.')
        r = pd.to_datetime('-'.join([t[0], t[1], str(y.year)]), dayfirst = True)
    else:
        r = pd.to_datetime('-'.join([x, str(y.month), str(y.year)]), dayfirst = True)
    return r

tmp['start'] = tmp.apply(dt_parse, axis = 1)

并获取

    start   end
0   2014-02-27  2014-03-11
1   2014-06-10  2014-06-11

其他(更有效/雄辩)的想法怎么做?

BR

1 个答案:

答案 0 :(得分:0)

您可以使用dt.str.extract使用正则表达式选择值:

In [108]: df = dt.str.extract(r'(?P<start_day>\d+)(?:\.(?P<start_month>\d+))?-(?P<end_day>\d+)\.(?P<end_month>\d+)\.(?P<year>\d+)')

In [109]: df
Out[109]: 
  start_day start_month end_day end_month  year
0        27          02      11        03  2014
1        10         NaN      11        06  2014

可以使用fillna方法填充缺少的start_month值:

df['start_month'] = df['start_month'].fillna(value=df['end_month'])

然后使用combine64函数(下面)将各个数字组合成np.datetime64值:

import numpy as np
import pandas as pd

def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
              seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
    years = np.asarray(years) - 1970
    months = np.asarray(months) - 1
    days = np.asarray(days) - 1
    types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
             '<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
    vals = (years, months, days, weeks, hours, minutes, seconds,
            milliseconds, microseconds, nanoseconds)
    return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
               if v is not None)

dt = pd.Series(['27.02-11.03.2014', '10-11.06.2014'])

df = dt.str.extract(r'(?P<start_day>\d+)(?:\.(?P<start_month>\d+))?-(?P<end_day>\d+)\.(?P<end_month>\d+)\.(?P<year>\d+)')
df = df.astype('float')
df['start_month'] = df['start_month'].fillna(value=df['end_month'])
df['start'] = combine64(df['year'], df['start_month'], df['start_day'])
df['end'] = combine64(df['year'], df['end_month'], df['end_day'])
df = df[['start', 'end']]
print(df)

产量

       start        end
0 2014-02-27 2014-03-11
1 2014-06-10 2014-06-11