有效地向数据帧添加行

时间:2018-04-01 20:13:41

标签: python-3.x performance pandas dataframe

我有一个数据集,其中包含特定数量的客户ID的历史定价数据。基本上,它是一个电子表格,有两列作为主键(id,p_date): enter image description here 请在下面找到数据集的片段(前4行):

{'id': {0: '038af19179925da21a25619c5a24b745',
1: '038af19179925da21a25619c5a24b745',
2: '038af19179925da21a25619c5a24b745',
3: '038af19179925da21a25619c5a24b745',
4: '038af19179925da21a25619c5a24b745'},
'p_date': {0: '2015-01-01',
1: '2015-02-01',
2: '2015-03-01',
3: '2015-04-01',
4: '2015-05-01'},
'p_p1_fix': {0: 44.26693116,
1: 44.26693116,
2: 44.26693116,
3: 44.26693116,
4: 44.26693116},
'p_p1_var': {0: 0.151367,
1: 0.151367,
2: 0.151367,
3: 0.149626,
4: 0.149626},
'p_p2_fix': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p2_var': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p3_fix': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p3_var': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}}

每个客户端ID最多有12个数据点,对应于一年中的12个月。我想将各个定价数据点用作分类模型中的单独特征,因此需要将它们分成不同的列。如何最有效地做到这一点?电子表格包含+ 20K用户的数据。我写了下面的脚本,花了一个多小时但还没有完成:

def transform_hist (hist):
  dates = pd.Series(['2015-01-01',
                         '2015-02-01',
                         '2015-03-01',
                         '2015-04-01',
                         '2015-05-01',
                         '2015-06-01',
                         '2015-07-01',
                         '2015-08-01',
                         '2015-09-01',
                         '2015-10-01',
                         '2015-11-01',
                         '2015-12-01',])
  ids = hist['id'].unique()
  rows_list = []
  for curr_id in ids:
    for curr_date in dates:
        temp = hist[(hist.id == curr_id) & (hist.p_date == curr_date)]
        if len(temp > 0):
            rows_list.append({'id':curr_id, 
                         'p1_var_'+curr_date[5:7]:temp['p_p1_var'],
                         'p2_var_'+curr_date[5:7]:temp['p_p2_var'],
                         'p3_var_'+curr_date[5:7]:temp['p_p3_var'],
                         'p1_fix_'+curr_date[5:7]:temp['p_p1_fix'],
                         'p2_fix_'+curr_date[5:7]:temp['p_p2_fix'],
                         'p3_fix_'+curr_date[5:7]:temp['p_p3_fix']
                        })
  df = pd.DataFrame(rows_list)
  flat_df = df.groupby(['id']).sum()
  return flat_df.reset_index()

预期输出将是一个包含73列的数据帧(id + 12 * 6列,每列存储12个数据点中每个数据点的6个价格之一),每个id有1行(目前每行最多12行) ID)。我在下面添加了上面数据框片段的预期输出示例:

{'id': {0: '0002203ffbb812588b632b9e628cc38d'},
'p_p1_fix_01': {0: 40.56596939999999},
'p_p1_fix_02': {0: 40.56596939999999},
'p_p1_fix_03': {0: 40.728885},
'p_p1_fix_04': {0: 40.728885},
'p_p1_fix_05': {0: nan},
'p_p1_fix_06': {0: nan},
'p_p1_fix_07': {0: nan},
'p_p1_fix_08': {0: nan},
'p_p1_fix_09': {0: nan},
'p_p1_fix_10': {0: nan},
'p_p1_fix_11': {0: nan},
'p_p1_fix_12': {0: nan},
'p_p1_var_01': {0: 0.12609800000000002},
'p_p1_var_02': {0: 0.12609800000000002},
'p_p1_var_03': {0: 0.12806700000000001},
'p_p1_var_04': {0: 0.12806700000000001},
'p_p1_var_05': {0: nan},
'p_p1_var_06': {0: nan},
'p_p1_var_07': {0: nan},
'p_p1_var_08': {0: nan},
'p_p1_var_09': {0: nan},
'p_p1_var_10': {0: nan},
'p_p1_var_11': {0: nan},
'p_p1_var_12': {0: nan},
'p_p2_fix_01': {0: 24.33958068},
'p_p2_fix_02': {0: 24.33958068},
'p_p2_fix_03': {0: 24.43733004},
'p_p2_fix_04': {0: 24.43733004},
'p_p2_fix_05': {0: nan},
'p_p2_fix_06': {0: nan},
'p_p2_fix_07': {0: nan},
'p_p2_fix_08': {0: nan},
'p_p2_fix_09': {0: nan},
'p_p2_fix_10': {0: nan},
'p_p2_fix_11': {0: nan},
'p_p2_fix_12': {0: nan},
'p_p2_var_01': {0: 0.103975},
'p_p2_var_02': {0: 0.103975},
'p_p2_var_03': {0: 0.105842},
'p_p2_var_04': {0: 0.105842},
'p_p2_var_05': {0: nan},
'p_p2_var_06': {0: nan},
'p_p2_var_07': {0: nan},
'p_p2_var_08': {0: nan},
'p_p2_var_09': {0: nan},
'p_p2_var_10': {0: nan},
'p_p2_var_11': {0: nan},
'p_p2_var_12': {0: nan},
'p_p3_fix_01': {0: 16.22638872},
'p_p3_fix_02': {0: 16.22638872},
'p_p3_fix_03': {0: 16.29155496},
'p_p3_fix_04': {0: 16.29155496},
'p_p3_fix_05': {0: nan},
'p_p3_fix_06': {0: nan},
'p_p3_fix_07': {0: nan},
'p_p3_fix_08': {0: nan},
'p_p3_fix_09': {0: nan},
'p_p3_fix_10': {0: nan},
'p_p3_fix_11': {0: nan},
'p_p3_fix_12': {0: nan},
'p_p3_var_01': {0: 0.070232},
'p_p3_var_02': {0: 0.070232},
'p_p3_var_03': {0: 0.073773},
'p_p3_var_04': {0: 0.073773},
'p_p3_var_05': {0: nan},
'p_p3_var_06': {0: nan},
'p_p3_var_07': {0: nan},
'p_p3_var_08': {0: nan},
'p_p3_var_09': {0: nan},
'p_p3_var_10': {0: nan},
'p_p3_var_11': {0: nan},
'p_p3_var_12': {0: nan}}

如何重写脚本以便更快地完成工作?

2 个答案:

答案 0 :(得分:1)

幸运的是,这可以非常简单地完成:

df = df.pivot(index='id', columns='p_date')

这将使用列的分层索引重塑df。要将层次结构级别展平为一个,您可以使用找到的解决方案here

df.columns = [' '.join(col).strip() for col in df.columns.values]

将列名设置为上一步中存在的每个层级的各自名称的连接。现在,每个日期组合与其他值列都有一列。

答案 1 :(得分:0)

如果我清楚了解,您正在寻找filter + groupby

hist['p_date']=pd.to_datetime(hist['p_date'])
s=hist.loc[hist['p_date'].isin(dates),:]
s.groupby([s['id'],s['p_date'].dt.month]).sum().reset_index()