Question

我有一个很大的数据框，在此条件下我可以计算均值。我需要将NaN更改为该城市的最后一个有效值。

我尝试了df ['Mean3big']。fillna（method ='ffill'，inplace = True），但由于我不考虑城市，因此我得到了错误的值。

df  = pd.DataFrame([["Gothenburg", "2018", 1.5, 2.3, 107],
["Gothenburg", 2018, 1.3, 3.3, 10],
["Gothenburg", 2018, 2.2, 2.3, 20],
["Gothenburg", 2018, 1.5, 2.1, 30],
["Gothenburg", 2018, 2.5, 2.3, 20],
["Malmo", 2018, 1.6, 2.3, 10],
["Gothenburg", 2018, 1.9, 2.8, 10],
["Malmo", 2018, 0.7, 4.3, 30],
["Gothenburg", 2018, 1.7, 3.2, 40],
["Malmo", 2018, 1.0, 3.3, 40],
["Gothenburg", 2018, 3.7, 2.3, 10],
["Malmo", 2018, 1.0, 2.9, 112],
["Gothenburg", 2018, 2.7, 2.3, 20],
["Gothenburg", 2019, 1.3, 3.3, 10],
["Gothenburg", 2019, 1.2, 2.3, 20],
["Gothenburg", 2019, 1.6, 2.1, 10],
["Gothenburg", 2019, 1.8, 2.3, 10],
["Malmo", 2019, 1.6, 1.3, 20],
["Gothenburg", 2019, 1.9, 2.8, 30]])

df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(df.groupby(['City', "Year"])['Val3'].transform(lambda x: x.expanding().mean().shift()).where(df['Val1'] > 1.6), 2)

我的结果：

      City  Year  Val1  Val2  Val3  Mean3big
0   Gothenburg  2018   1.5   2.3   107       NaN
1   Gothenburg  2018   1.3   3.3    10       NaN
2   Gothenburg  2018   2.2   2.3    20     10.00
3   Gothenburg  2018   1.5   2.1    30       NaN
4   Gothenburg  2018   2.5   2.3    20     20.00
5        Malmo  2018   1.6   2.3    10       NaN
6   Gothenburg  2018   1.9   2.8    10     20.00
7        Malmo  2018   0.7   4.3    30       NaN
8   Gothenburg  2018   1.7   3.2    40     18.00
9        Malmo  2018   1.0   3.3    40       NaN
10  Gothenburg  2018   3.7   2.3    10     21.67
11       Malmo  2018   1.0   2.9   112       NaN
12  Gothenburg  2018   2.7   2.3    20     20.00
13  Gothenburg  2019   1.3   3.3    10       NaN
14  Gothenburg  2019   1.2   2.3    20       NaN
15  Gothenburg  2019   1.6   2.1    10       NaN
16  Gothenburg  2019   1.8   2.3    10     13.33
17       Malmo  2019   1.6   1.3    20       NaN
18  Gothenburg  2019   1.9   2.8    30     12.50

我希望Mean3big第3行提供城市“ Gothenburg”的最后一个有效值=10。使用NaN可以确定第0行和第1行，因为我没有先前的有效值。

第7行应为20，其中“ Malmo”的最后一个有效值。 Nan没有问题，因此第5行可以正常使用，依此类推...

Answer 1

未考虑您帖子中的最后一句话。也许试试看：

import pandas as pd

df = pd.DataFrame(
    [
        ["Gothenburg", "2018", 1.5, 2.3, 107],
        ["Gothenburg", 2018, 1.3, 3.3, 10],
        ["Gothenburg", 2018, 2.2, 2.3, 20],
        ["Gothenburg", 2018, 1.5, 2.1, 30],
        ["Gothenburg", 2018, 2.5, 2.3, 20],
        ["Malmo", 2018, 1.6, 2.3, 10],
        ["Gothenburg", 2018, 1.9, 2.8, 10],
        ["Malmo", 2018, 0.7, 4.3, 30],
        ["Gothenburg", 2018, 1.7, 3.2, 40],
        ["Malmo", 2018, 1.0, 3.3, 40],
        ["Gothenburg", 2018, 3.7, 2.3, 10],
        ["Malmo", 2018, 1.0, 2.9, 112],
        ["Gothenburg", 2018, 2.7, 2.3, 20],
        ["Gothenburg", 2019, 1.3, 3.3, 10],
        ["Gothenburg", 2019, 1.2, 2.3, 20],
        ["Gothenburg", 2019, 1.6, 2.1, 10],
        ["Gothenburg", 2019, 1.8, 2.3, 10],
        ["Malmo", 2019, 1.6, 1.3, 20],
        ["Gothenburg", 2019, 1.9, 2.8, 30],
    ]
)

df.columns = ['City', 'Year', 'Val1', 'Val2', 'Val3']
df["Mean3big"] = round(
    df.groupby(['City', "Year"])['Val3']
    .transform(lambda x: x.expanding().mean().shift())
    .where(df['Val1'] > 1.6),
    2,
)
print(df)

valids = {}
for index, row in df.iterrows():
    # this if checks if the value is NaN, you can import math and use isnan() instead
    if row['Mean3big'] != row['Mean3big']:
        if row['City'] in valids:
            df.at[index, 'Mean3big'] = valids[row['City']]
    else:
        valids[row['City']] = row['Mean3big']

print(df)

时间复杂度为O（n）。

熊猫-某些列的最后一个有效值

1 个答案: