操纵日期字段熊猫

时间:2016-02-02 06:22:51

标签: python pandas

在pandas数据框中操作日期字段的最快捷方式是什么,例如将日期的日期值替换为该月的最后一天。目前我可以做以下但是需要很长时间才能运行。

import calendar
consumption_data_monthly.DATE = consumption_data_monthly.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1]))

2 个答案:

答案 0 :(得分:2)

我认为 #sample_editable_1 tbody { counter-reset: tablerow; } #sample_editable_1 .sorting_1::before { counter-increment: tablerow; content: counter(tablerow)". "; } 非常有效且非常快,但矢量化速度更快。

您可以尝试按valuesastype将列calendar.monthrange转换为月份DATE数组,然后添加下一个numpy并减去一个{ {1}}:

month

时间day

df['DATE'] = df['DATE'].values.astype('datetime64[M]') + 
             np.array([1], dtype='timedelta64[M]') - 
             np.array([1], dtype='timedelta64[D]')

代码:

len(df)=70000

时间In [468]: %timeit one(df) 1 loops, best of 3: 881 ms per loop In [469]: %timeit two(df1) 1 loops, best of 3: 733 ms per loop In [470]: %timeit three(df2) 1 loops, best of 3: 1.24 s per loop In [471]: %timeit four(df3) 100 loops, best of 3: 6.61 ms per loop In [472]: %timeit five(df4) 100 loops, best of 3: 8.76 ms per loop

import pandas as pd
import numpy as np
import calendar
import datetime
from pandas.tseries.offsets import *

d = {'DATE': {0: pd.Timestamp('2012-01-05 00:00:00'), 1: pd.Timestamp('2012-02-08 00:00:00'), 2: pd.Timestamp('2012-03-11 00:00:00'), 3: pd.Timestamp('2012-04-06 00:00:00'), 4: pd.Timestamp('2012-05-04 00:00:00'), 5: pd.Timestamp('2012-06-20 00:00:00'), 6: pd.Timestamp('2012-07-09 00:00:00')}}
df = pd.DataFrame(d)
print df

df =  pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
df4 = df.copy()

def one(df):
    df.DATE = df.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1]))
    return df

def two(df):    
    df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, calendar.monthrange(x.year,x.month)[1]))
    return df

def three(df):    
    df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, x.days_in_month))
    return df

def four(df): 
    df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]')
    return df

def five(df):    
    df['DATE'] = df['DATE'] + MonthEnd()
    return df

print one(df).head()
print two(df1).head()
print three(df2).head()
print four(df4).head()

答案 1 :(得分:1)

使用DateOffset将月末添加到您的日期:

In [25]:
df['DATE'] + MonthEnd()
from pandas.tseries.offsets import *
df['DATE'] + MonthEnd()

Out[25]:
0   2012-01-31
1   2012-02-29
2   2012-03-31
3   2012-04-30
4   2012-05-31
5   2012-06-30
6   2012-07-31
Name: DATE, dtype: datetime64[ns]

<强>计时

In [26]:
def four(df): 
    df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]')
    return df
​
%timeit four(df)
%timeit df['DATE'] = MonthEnd()
1000 loops, best of 3: 206 µs per loop
The slowest run took 272.78 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 139 µs per loop

您可以看到使用偏移比建议的解决方案更快

在70K行上,时间为:

100 loops, best of 3: 5.69 ms per loop
100 loops, best of 3: 8 ms per loop

所以对于更大的dfs,其他解决方案更快,这里语法更清晰