数据
我有一个数据集,可以显示按公司和月份分组的最新预订数据(空值为NaNs)
company month year_ly bookings_ly year_ty bookings_ty
company a 1 2018 432 2019 253
company a 2 2018 265 2019 635
company a 3 2018 345 2019 525
company a 4 2018 233 2019
company a 5 2018 7664 2019
... ... ... ... ... ...
company a 12 2018 224 2019 321
company b 1 2018 543 2019 576
company b 2 2018 23 2019 43
company b 3 2018 64 2019 156
company b 4 2018 143 2019
company b 5 2018 41 2019
company b 6 2018 90 2019
... ... ... ... ... ...
我想要的
我想创建一个列或更新值为bookings_ty
的{{1}}列(以较容易者为准),以对每一行(按公司分组)应用以下计算:
NaN
如果某行的((SUM of previous 3 rows (or months) of bookings_ty)
/(SUM of previous 3 rows (or months) of bookings_ly))
* bookings_ly
是NaN,我希望公式的迭代将新计算的字段作为其bookings_ty
的一部分,因此,本质上该公式应该做的是在其中填充NaN值。 bookings_ty
。
我的尝试
bookings_ty
此代码的问题在于,它仅针对第一个空/ NaN df_bkgs.set_index(['operator', 'month'], inplace=True)
def calc(df_bkgs):
df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy
df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m']
df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly']
return df_bkgs
df_bkgs.groupby(level=0).apply(calc)
import numpy as np
df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty'])
生成计算字段。我想要的是有一个迭代或循环类型的过程,然后该过程取该组中的前3行,如果bookings_ty
为空/ NaN,则取该行的计算字段。
谢谢
答案 0 :(得分:0)
您可以尝试一下。我做了一个函数,可以按行在数据框中找到最后3条记录。注意,由于您无法在apply语句中访问索引(据我所知),因此我必须创建一个名为index的列。
### import the sklearn module for GaussianNB
from sklearn.naive_bayes import GaussianNB as gnb
### create classifier
clf= gnb()
### fit the classifier on the training features and labels
model= clf.fit(features_train, labels_train)
### return the fit classifier
return model
答案 1 :(得分:0)
根据您表中有多少家公司,我可能倾向于在Excel上运行此方法,而不是在熊猫上执行此操作。遍历行可能很慢,但是如果不考虑速度,则可以使用以下解决方案:
import numpy as np
import pandas as pd
df = pd.read_excel('data_file.xlsx') # <-- name of your file.
companies = pd.unique(df.company)
months = pd.unique(df.month)
for c in companies:
for m in months:
# slice a single row
df_row= df[(df['company']==c) & (df['month']==m)]
val = df_slice.bookings_ty.values[0]
if np.isnan(val):
# get the index of the row
idx = df_row.index[0]
df1 = df.copy()
df1 = df1[(df1['company']==c) & (df1['month'].isin([m for m in range(m-3,m)]))]
ratio = df1.bookings_ty.sum() / df1.bookings_ly.sum()
projected_value = df_slice.bookings_ly.values[0] * ratio
df.loc[idx, 'bookings_ty'] = projected_value
else:
pass
print(df)
如果我们可以假设DataFrame总是按“公司”然后按“月”排序,那么我们可以使用以下方法,我的示例将效率提高了20倍(0.003s对0.07s) 24行数据。
df = pd.read_excel('data_file.xlsx') # your input file
ly = df.bookings_ly.values.tolist()
ty = df.bookings_ty.values.tolist()
for val in ty:
if np.isnan(val):
idx = ty.index(val) # returns the index of the first 'nan' found
ratio = sum(ty[idx-3:idx])/sum(ly[idx-3:idx])
ty[idx] = ratio * ly[idx]
df['bookings_ty'] = ty
答案 2 :(得分:0)
这是一个解决方案:
import numpy as np
import pandas as pd
#sort values if not
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x):
while x['bookings_ty'].isnull().any():
x['bookings_ty'] = np.where((x['bookings_ty'].isnull()),
(x['bookings_ty'].shift(1) +
x['bookings_ty'].shift(2) +
x['bookings_ty'].shift(3)) /
(x['bookings_ly'].shift(1) +
x['bookings_ly'].shift(2) +
x['bookings_ly'].shift(3)) *
x['bookings_ly'], x['bookings_ty'])
return x
df = df.groupby(['company']).apply(lambda x: process(x))
#convert to int64 if needed or stay with float values
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
初始DF:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
结果:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 525
3 company_a 4 2018 233 2019 315 **
4 company_a 5 2018 7664 2019 13418 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
如果您希望每个月都有新的滚动月份,或者在每个公司的开头可能存在NaN值,则可以使用以下通用解决方案:
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)
def process(x, m):
idx = (x.loc[x['bookings_ty'].isnull()].index.to_list())
for i in idx:
id = i - x.index[0]
start = 0 if id < m else id - m
sum_ty = sum(x['bookings_ty'].to_list()[start:id])
sum_ly = sum(x['bookings_ly'].to_list()[start:id])
ly = x.at[i, 'bookings_ly']
x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly
return x
rolling_month = 3
df = df.groupby(['company']).apply(lambda x: process(x, rolling_month))
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)
初始df:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253.0
1 company_a 2 2018 265 2019 635.0
2 company_a 3 2018 345 2019 NaN
3 company_a 4 2018 233 2019 NaN
4 company_a 5 2018 7664 2019 NaN
5 company_a 12 2018 224 2019 321.0
6 company_b 1 2018 543 2019 576.0
7 company_b 2 2018 23 2019 43.0
8 company_b 3 2018 64 2019 156.0
9 company_b 4 2018 143 2019 NaN
10 company_b 5 2018 41 2019 NaN
11 company_b 6 2018 90 2019 NaN
最终结果:
company month year_ly bookings_ly year_ty bookings_ty
0 company_a 1 2018 432 2019 253
1 company_a 2 2018 265 2019 635
2 company_a 3 2018 345 2019 439 ** work only with 2 previous rows
3 company_a 4 2018 233 2019 296 **
4 company_a 5 2018 7664 2019 12467 **
5 company_a 12 2018 224 2019 321
6 company_b 1 2018 543 2019 576
7 company_b 2 2018 23 2019 43
8 company_b 3 2018 64 2019 156
9 company_b 4 2018 143 2019 175 **
10 company_b 5 2018 41 2019 66 **
11 company_b 6 2018 90 2019 144 **
如果您想加快这一过程,可以尝试:
df.set_index(['company'], inplace=True)
df = df.groupby(level=(0)).apply(lambda x: process(x))
代替
df = df.groupby(['company']).apply(lambda x: process(x))