数据框列表上的迭代

时间:2018-07-30 11:00:00

标签: python list pandas pivot-table

我有以下问题: 我有3列的数据框: 第一个是userID,第二个是invoiceType,第三个是创建发票的时间。

df = pd.read_csv('invoice.csv')
Output: UserID  InvoiceType   CreateTime
         1         a          2018-01-01 12:31:00
         2         b          2018-01-01 12:34:12
         3         a          2018-01-01 12:40:13
         1         c          2018-01-09 14:12:25
         2         a          2018-01-12 14:12:29
         1         b          2018-02-08 11:15:00
         2         c          2018-02-12 10:12:12

我正在尝试绘制每个用户的发票周期。我需要创建2个新列time_difftime_diff_wrt_first_invoicetime_diff将代表每个用户的每张发票之间的时间差,而time_diff_wrt_first_invoice将代表所有发票和第一张发票之间的时间差,这对于绘图很有趣。这是我的代码:

"""
********** Exploding a variable that is a list in each dataframe cell 

"""
def explode_list(df,x):
  return (df[x].apply(pd.Series)
  .stack()
  .reset_index(level = 1, drop=True)
  .to_frame(x))

"""
  ****** applying explode_list to all the columns ******
"""

def explode_listDF(df):
    exploaded_df = pd.DataFrame()

    for x in df.columns.tolist():
        exploaded_df = pd.concat([exploaded_df, explode_list(df,x)], 
        axis = 1)

    return exploaded_df


 """
   ******** Getting the time difference column in pivot table format
 """
def pivoted_diffTime(df1, _freq=60):

    # _ freq is 1 for minutes frequency
    # _freq is 60 for hour frequency
    # _ freq is 60*24 for daily frequency
    # _freq is 60*24*30 for monthly frequency

    df = df.sort_values(['UserID', 'CreateTime'])

    df_pivot = df.pivot_table(index = 'UserID', 
                         aggfunc= lambda x : list(v for v in x)
                         )

    df_pivot['time_diff'] = [[0]]*len(df_pivot)

    for user in df_pivot.index:

        try:    
           _list = [0]+[math.floor((x - y).total_seconds()/(60*_freq)) 
           for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:], 
           df_pivot.loc[user, 'CreateTime'][:-1])]

           df_pivot.loc[user, 'time_diff'] = _list


        except:
            print('There is a prob here :', user)

    return df_pivot


"""
***** Pipelining the two functions to obtain an exploaded dataframe 
 with time difference ******
"""
def get_timeDiff(df, _frequency):

    df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))

    return df

一旦有了time_diff,就可以通过以下方式创建time_diff_wrt_first_variable:

# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] = 
[[0]]*len(df_with_timeDiff)

# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():

 df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])

问题是我有一个拥有成千上万用户的数据框,而且非常耗时。我想知道是否有适合我需要的解决方案。

2 个答案:

答案 0 :(得分:0)

检出.loc []中的熊猫。

    df_1 = pd.DataFrame(some_stuff)

    df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']        

您可以访问特定的列,运行循环以检查某些类型的条件,如果在条件后添加逗号并输入特定的列名称,则只会返回该列。 我不确定100%是否能回答您所问的任何问题,因为我实际上没有看到任何问题,但是您似乎在运行很多for循环和东西来隔离列,这就是{{1 }}用于。

答案 1 :(得分:0)

我找到了更好的解决方案。这是我的代码:

def next_diff(x):
   return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])


def create_timediff(df):

   df.sort_values(['UserID', 'CreateTime'], inplace=True)
   a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
   b = a.apply(np.cumsum)

   a = a.reset_index()
   b = b.reset_index()

   # Here I explode the lists inside the cell
   rows1= []
   _ = a.apply(lambda row: [rows1.append([row['UserID'], nn]) 
                     for nn in row.CreateTime], axis=1)
   rows2 = []
   __ = b.apply(lambda row: [rows2.append([row['UserID'], nn]) 
                     for nn in row.CreateTime], axis=1)

   df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
   df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])

   df = df.set_index('UserID')
   df['time_diff']= df1_new['CreateTime']
   df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
   df.reset_index(inplace=True)

   return df