在熊猫中的DatetimeIndex的矢量化构造

时间:2016-04-22 22:16:45

标签: python pandas

我希望通过广播的年,月,日,小时等数组在Pandas中创建DateTimeIndex。通过列表理解,这是相对简单的。 e.g。

import numpy as np
import pandas as pd

def build_DatetimeIndex(*args):
    return pd.DatetimeIndex([pd.datetime(*tup)
                             for tup in np.broadcast(*args)])

例如:

>>> year = 2012
>>> months = [1, 2, 5, 6]
>>> days = [1, 15, 1, 15]
>>> build_DatetimeIndex(year, months, days)
DatetimeIndex(['2012-01-01', '2012-02-15', '2012-05-01', '2012-06-15'], 
              dtype='datetime64[ns]', freq=None)

但是由于列表理解,随着输入的大小增加,这变得相当缓慢。在Pandas中是否有内置的方法来执行此操作,或者是否有任何方法可以根据快速矢量化操作定义build_DatetimeIndex

4 个答案:

答案 0 :(得分:3)

您可以使用dtypes m8[Y]m8[M]m8[D]制作Timedeltas数组,并将它们一起添加到日期:" 0000-01-01& #34;:

import pandas as pd
import numpy as np

year = np.arange(2010, 2020)
months = np.arange(1, 13)
days = np.arange(1, 29)

y, m, d = map(np.ravel, np.broadcast_arrays(*np.ix_(year, months, days)))

start = np.array(["0000-01-01"], dtype="M8[Y]")

r1 = start + y.astype("m8[Y]") + (m - 1).astype("m8[M]") + (d-1).astype("m8[D]")

def build_DatetimeIndex(*args):
    return pd.DatetimeIndex([pd.datetime(*tup)
                             for tup in np.broadcast(*args)])

r2 = build_DatetimeIndex(y, m, d)

np.all(pd.DatetimeIndex(r1) == r2)

包括小时,分钟,秒:

import pandas as pd
import numpy as np

y = np.array([2012, 2013])
m = np.array([1, 3])
d = np.array([5, 20])
H = np.array([10, 20])
M = np.array([30, 40])
S = np.array([0, 30])

start = np.array(["0000-01-01"], dtype="M8[Y]")

date = start + y.astype("m8[Y]") + (m - 1).astype("m8[M]") + (d-1).astype("m8[D]")
datetime = date.astype("M8[s]") + H.astype("m8[h]") + M.astype("m8[m]") + S.astype("m8[s]")

pd.Series(datetime)

结果:

0   2012-01-05 10:30:00
1   2013-03-20 20:40:30
dtype: datetime64[ns]

答案 1 :(得分:1)

解决方案

import numpy as np
import pandas as pd

def build_DatetimeIndex(years, months, days):
    years = pd.Index(years, name='year')
    months = pd.Index(months, name='month')
    days = pd.Index(days, name='day')

    panel = pd.Panel(items=days, major_axis=years, minor_axis=months)

    to_dt = lambda x: pd.datetime(*x)
    series = panel.fillna(0).to_frame().stack().index.to_series()

    return pd.DatetimeIndex(series.apply(to_dt))

示范

dti = build_DatetimeIndex(range(1900, 2000), range(1, 13), [1, 15])

print dti

DatetimeIndex(['1900-01-01', '1900-01-15', '1900-02-01', '1900-02-15',
               '1900-03-01', '1900-03-15', '1900-04-01', '1900-04-15',
               '1900-05-01', '1900-05-15',
               ...
               '1999-08-01', '1999-08-15', '1999-09-01', '1999-09-15',
               '1999-10-01', '1999-10-15', '1999-11-01', '1999-11-15',
               '1999-12-01', '1999-12-15'],
              dtype='datetime64[ns]', length=2400, freq=None)

答案 2 :(得分:1)

另一种解决方案

import pandas as pd
import numpy as np

def nao(*args):
    if len(args) == 1:
        return np.asarray(args[-1]).flatten()
    else:
        return np.add.outer(args[-1], nao(*args[:-1]) * 1e2).flatten()

def handler(*args):
    fmt = np.array(['%Y', '%m', '%d', '%H', '%M', '%S'])
    fstr = "".join(fmt[range(len(args))])
    ds = nao(*args).astype(np.dtype(int))
    return pd.Index(pd.Series(ds).apply(lambda x: pd.datetime.strptime(str(x), fstr)))

示范

handler(range(1900, 2000), range(1, 13), range(1, 28))

DatetimeIndex(['1900-01-01', '1901-01-01', '1902-01-01', '1903-01-01',
               '1904-01-01', '1905-01-01', '1906-01-01', '1907-01-01',
               '1908-01-01', '1909-01-01',
               ...
               '1990-12-27', '1991-12-27', '1992-12-27', '1993-12-27',
               '1994-12-27', '1995-12-27', '1996-12-27', '1997-12-27',
               '1998-12-27', '1999-12-27'],
              dtype='datetime64[ns]', length=32400, freq=None)

定时测试

stamp = pd.datetime.now()
for _ in range (10):
    handler(range(1900, 2000), range(1, 13), range(1, 28))
print pd.datetime.now() - stamp

0:00:04.870000

答案 3 :(得分:1)

这只是为了结束循环,并给出了pd.to_datetime功能的一个示例,该功能是Jeff在https://github.com/pydata/pandas/pull/12967中指出的。

pd.to_datetimeDataFrame中的带有或不带有列的年,月,日等中均可使用。 (请参阅Github讨论,以获取具有现有列的示例。)

根据示例,创建的DatetimeIndex 没有DataFrame中包含年,月,日等的任何现有列。这是可能的。

import numpy as np
import pandas as pd

datedict = {'year':  [2012]*4, # Length must equal 'month' and 'day' length
            'month': [1, 2, 5, 6], 
            'day':   [1, 15, 1, 15]}
pd.DatetimeIndex(pd.to_datetime(datedict))
DatetimeIndex(['2012-01-01', '2012-02-15', '2012-05-01', '2012-06-15'], 
              dtype='datetime64[ns]', freq=None)