我有一个很大的数据框,在此条件下我可以计算均值。我需要将NaN更改为该城市的最后一个有效值。
我尝试了df ['Mean3big']。fillna(method ='ffill',inplace = True),但由于我不考虑城市,因此我得到了错误的值。
df = pd.DataFrame([["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30]])
df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(df.groupby(['City', "Year"])['Val3'].transform(lambda x: x.expanding().mean().shift()).where(df['Val1'] > 1.6), 2)
我的结果:
City Year Val1 Val2 Val3 Mean3big
0 Gothenburg 2018 1.5 2.3 107 NaN
1 Gothenburg 2018 1.3 3.3 10 NaN
2 Gothenburg 2018 2.2 2.3 20 10.00
3 Gothenburg 2018 1.5 2.1 30 NaN
4 Gothenburg 2018 2.5 2.3 20 20.00
5 Malmo 2018 1.6 2.3 10 NaN
6 Gothenburg 2018 1.9 2.8 10 20.00
7 Malmo 2018 0.7 4.3 30 NaN
8 Gothenburg 2018 1.7 3.2 40 18.00
9 Malmo 2018 1.0 3.3 40 NaN
10 Gothenburg 2018 3.7 2.3 10 21.67
11 Malmo 2018 1.0 2.9 112 NaN
12 Gothenburg 2018 2.7 2.3 20 20.00
13 Gothenburg 2019 1.3 3.3 10 NaN
14 Gothenburg 2019 1.2 2.3 20 NaN
15 Gothenburg 2019 1.6 2.1 10 NaN
16 Gothenburg 2019 1.8 2.3 10 13.33
17 Malmo 2019 1.6 1.3 20 NaN
18 Gothenburg 2019 1.9 2.8 30 12.50
我希望Mean3big第3行提供城市“ Gothenburg”的最后一个有效值=10。使用NaN可以确定第0行和第1行,因为我没有先前的有效值。
第7行应为20,其中“ Malmo”的最后一个有效值。 Nan没有问题,因此第5行可以正常使用,依此类推...
答案 0 :(得分:0)
未考虑您帖子中的最后一句话。也许试试看:
import pandas as pd
df = pd.DataFrame(
[
["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30],
]
)
df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(
df.groupby(['City', "Year"])['Val3']
.transform(lambda x: x.expanding().mean().shift())
.where(df['Val1'] > 1.6),
2,
)
print(df)
valids = {}
for index, row in df.iterrows():
# this if checks if the value is NaN, you can import math and use isnan() instead
if row['Mean3big'] != row['Mean3big']:
if row['City'] in valids:
df.at[index, 'Mean3big'] = valids[row['City']]
else:
valids[row['City']] = row['Mean3big']
print(df)
时间复杂度为O(n)。