以特定频率扩展数据框-Python

时间:2018-09-03 09:30:15

标签: python pandas performance numpy dataframe

我有一个包含以下列的数据框(仅摘录):

     START                    END               FREQ     VARIABLE    
'2017-03-26 16:55:00'  '2017-10-28 16:55:00'   1234567      x
'2017-03-26 20:35:00'  '2017-10-28 20:35:00'   1234567      y
'2017-03-26 14:55:00'  '2017-10-28 14:55:00'   ..3.567      y
'2017-03-26 11:15:00'  '2017-10-28 11:15:00'   1234567      y
'2017-03-26 09:30:00'  '2017-06-11 09:30:00'   ......7      x

我的目标是创建一个新的数据框,根据“ FREQ”列,通过产生从“ START”日期开始到“ END”日期结束的每日行来扩展此数据框。在此“ FREQ”列中,1 =星期一,7 =星期日。 “点”表示不应在一周的特定日期创建该行。因此,.. 3.5.7仅在星期三,星期五和星期日仅对应于3个新行。每个创建的行的“变量”列应始终具有相同的值。

我的主要问题是新数据框将具有数百万行,因此,我一直在寻找一种真正有效的解决方案。

用Python代码编写的数据框:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([ 
'2017-03-26 16:55:00','2017-10-28 16:55:00', '1234567', 'x',
'2017-03-26 20:35:00','2017-10-28 20:35:00','1234567','y',
'2017-03-26 14:55:00','2017-10-28 14:55:00','..3.567','y',
'2017-03-26 11:15:00','2017-10-28 11:15:00','1234567','y',
'2017-03-26 09:30:00','2017-06-11 09:30:00','......7','x']).reshape((5, 4)))
df.columns = ['START','END','FREQ','VARIABLE']

2 个答案:

答案 0 :(得分:1)

这是使用 DataFrame.iterrows() 的一种潜在方法:

chunks = []
for _, row in df.iterrows():
    freq = [int(i)-1 for i in str(row[2]).replace('.', '')]
    dateidx = [d for d in pd.date_range(row[0], row[1], freq='D') if d.weekday() in freq]
    chunks.append(pd.DataFrame({'date': dateidx}).assign(variable=row[3]))
df_expanded = pd.concat(chunks, ignore_index=True)

答案 1 :(得分:1)

修订后的答案:

这是使用pandas iloc和numpy repeat来从原始数据帧索引中创建一个新的数据帧,但是要根据日期范围和有效的工作日确定重复索引之后。

import pandas as pd
import numpy as np

df_arr = np.array([ 
    '2017-03-26 16:55:00', '2017-10-28 16:55:00', '1234567', 'x',
    '2017-03-26 20:35:00', '2017-10-28 20:35:00', '1234567', 'y',
    '2017-03-26 14:55:00', '2017-10-28 14:55:00', '..3.567', 'y',
    '2017-03-26 11:15:00', '2017-10-28 11:15:00', '1234567', 'y',
    '2017-03-26 09:30:00', '2017-06-11 09:30:00', '......7',' x'])

df = pd.DataFrame(df_arr.reshape(5, 4),
                  columns=['START', 'END', 'FREQ', 'VARIABLE'])

def get_weekdays_dates_repeats(start, end, valid_weekday_nums):
    date_range = pd.date_range(start, end, freq="D", normalize=True)
    all_day_nums = date_range.dayofweek.values + 1
    filtered_idx = np.where(np.isin(all_day_nums, valid_weekday_nums))
    day_nums = all_day_nums[filtered_idx]
    dates = date_range[filtered_idx]
    return day_nums, dates.values.astype('<M8[D]'), day_nums.size

starts = df.START.values
ends = df.END.values
freqs = df.FREQ.str.replace('.', '').values

repeats = np.zeros(len(df))
weekdays_arr_list = []
dates_arr_list = []
for i in range(len(df)):
    valid_day_nums = [int(s) for s in list(freqs[i])]
    days, dates, repeat = \
        get_weekdays_dates_repeats(starts[i], ends[i], valid_day_nums)
    weekdays_arr_list.append(days)
    dates_arr_list.append(dates)
    repeats[i] = repeat

weekday_col = np.concatenate(weekdays_arr_list)
dates_col = np.concatenate(dates_arr_list)
repeats = repeats.astype(int)

df2 = df.iloc[np.repeat(df.index.values, repeats)].reset_index(drop=True)

df2['day_num'] = weekday_col
df2['date'] = dates_col

df2.head()

                  START        END          FREQ    VARIABLE    day_num date
0   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   7   2017-03-26
1   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   1   2017-03-27
2   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   2   2017-03-28
3   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   3   2017-03-29
4   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   4   2017-03-30

df2.tail()

                  START                END  FREQ    VARIABLE    day_num date
782 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-14
783 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-21
784 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-28
785 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-06-04
786 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-06-11