如何在熊猫数据框中进行复杂的计算

时间:2021-07-27 07:55:50

标签: python pandas dataframe numpy

示例数据框:

df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
                   '2020-01': [24,42,18,68,24,30],
                   '2020-02': [24,42,18,68,24,30],
                   '2020-03': [64,24,70,70,88,57],
                   '2020-04': [22,11,44,3,5,78],
                   '2020-05': [11,35,74,12,69,51]}

我想在下面找到df['L2']

我学过pandas roll、groupby等,解决不了。

请阅读L2公式并给我意见

L2 公式

L2(Jan-20) = 24
-------------------
     sales  2020-01
0  2020-01       24
-------------------

L2(Feb-20) = 132  (sum of below matrix 2x2)

     sales  2020-01  2020-02
0  2020-01       24       24
1  2020-02       42       42
-------------------
L2(Mar-20) = 154 (sum of matrix 2x2)

     sales  2020-02  2020-03
0  2020-02       42       24
1  2020-03       18       70
-------------------
L2(Apr-20) = 187 (sum of below maxtrix 2x2)

     sales  2020-03  2020-04
0  2020-03       70       44
1  2020-04       70        3

输出

   Unnamed: 0   sales  Jan-20  Feb-20  Mar-20  Apr-20  May-20   L2   L3
0           0  Jan-20      24      24      64      22      11   24   24
1           1  Feb-20      42      42      24      11      35  132  132
2           2  Mar-20      18      18      70      44      74  154  326
3           3  Apr-20      68      68      70       3      12  187  350
4           4  May-20      24      24      88       5      69   89  545
5           5  Jun-20      30      30      57      78      51  203  433

3 个答案:

答案 0 :(得分:0)

import pandas as pd
import numpy as np


# make a dataset

df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
                   '2020-01': [24,42,18,68,24,30],
                   '2020-02': [24,42,18,68,24,30],
                   '2020-03': [64,24,70,70,88,57],
                   '2020-04': [22,11,44,3,5,78],
                   '2020-05': [11,35,74,12,69,51]})

print(df)

# datawork(L2)

for i in range(0,df.shape[0]):
    
    if i==0:
        df.loc[i,'L2']=df.loc[i,'2020-01']
    
    else:
        if i!=df.shape[0]-1:
            df.loc[i,'L2']=df.iloc[i-1:i+1,i:i+2].sum().sum()
            
        if i==df.shape[0]-1:
             df.loc[i,'L2']=df.iloc[i-1:i+1,i-1:i+1].sum().sum()

print(df)

#     sales  2020-01  2020-02  2020-03  2020-04  2020-05     L2
#0  2020-01       24       24       64       22       11   24.0
#1  2020-02       42       42       24       11       35  132.0
#2  2020-03       18       18       70       44       74  154.0
#3  2020-04       68       68       70        3       12  187.0
#4  2020-05       24       24       88        5       69   89.0
#5  2020-06       30       30       57       78       51  203.0

答案 1 :(得分:0)

Values=f.values[:,1:]
L2=[]
RANGE=Values.shape[0]
for a in range(RANGE):
    if a==0:
        result=Values[a,a]
    else:
        if Values[a-1:a+1,a-1:a+1].shape==(2,1):
            result=np.sum(Values[a-1:a+1,a-2:a])
        
        else:
            result=np.sum(Values[a-1:a+1,a-1:a+1])
         
    L2.append(result)
print(L2)
L2 output:-->[24, 132, 154, 187, 89, 203]

f["L2"]=L2

f: enter image description here

答案 2 :(得分:0)

我尝试了另一种方法。 这个方法使用了reshape long(在python中:melt),但是我在python中应用了reshape long两次,因为df中销售和其他列的时间频率是每月而不是每天,所以我再一次reshape long使int列对应于每月日期。 (我用Stata的次数比python多,在Stata中,我只能做一次reshape很长时间,因为它有每月的时间频率,而且reshape任务比pandas,python要容易得多)

有兴趣的可以看看

# 00.module

import pandas as pd
import numpy as np
from order import order # https://stackoverflow.com/a/68464246/16478699

# 0.make a dataset

df = pd.DataFrame({'sales': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'],
                   '2020-01': [24, 42, 18, 68, 24, 30],
                   '2020-02': [24, 42, 18, 68, 24, 30],
                   '2020-03': [64, 24, 70, 70, 88, 57],
                   '2020-04': [22, 11, 44, 3, 5, 78],
                   '2020-05': [11, 35, 74, 12, 69, 51]}
                  )
df.to_stata('dataset.dta', version=119, write_index=False)
print(df)

# 1.reshape long(in python: melt)

t = list(df.columns)
t.remove('sales')
df_long = df.melt(id_vars='sales', value_vars=t, var_name='var', value_name='val')
df_long['id'] = list(range(1, df_long.shape[0] + 1))  # make id for another resape long
print(df_long)

# 2.another reshape long(in python: melt, reason: make int(col name: tid) corresponding to monthly date of sales and monthly columns in df)

df_long2 = df_long.melt(id_vars=['id', 'val'], value_vars=['sales', 'var'])
df_long2['tid'] = df_long2['value'].apply(lambda x: 1 + list(df_long2.value.unique()).index(x))
print(df_long2)

# 3.back to wide form with tid(in python: pd.pivot)

df_wide = pd.pivot(df_long2, index=['id', 'val'], columns='variable', values=['value', 'tid'])
df_wide.columns = df_wide.columns.map(lambda x: x[1] if x[0] == 'value' else f'{x[0]}_{x[1]}')  # change multiindex columns name into just normal columns name
df_wide = df_wide.reset_index()
print(df_wide)

# 4.make values of L2

for i in df_wide.tid_sales.unique():

    if list(df_wide.tid_sales.unique()).index(i) + 1 == len(df_wide.tid_sales.unique()):
        df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
                    df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i - 1) | (
                    df_wide['tid_var'] == i - 2))), 'val'].sum()

    else:
        df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
                    df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i) | (
                    df_wide['tid_var'] == i - 1))), 'val'].sum()

print(df_wide)

# 5.back to shape of df with L2(reshape wide, in python: pd.pivot)

df_final = df_wide.drop(columns=df.filter(regex='^tid'))  # no more columns starting with tid needed
df_final = pd.pivot(df_final, index=['sales', 'L2'], columns='var', values='val').reset_index()
df_final = order(df_final, 'L2', f_or_l='last')  # order function is made by me
print(df_final)
相关问题