假设我有一个数据框:
>>> import pandas as pd
>>> import numpy as np
>>> rand = np.random.RandomState(42)
>>> data_points = 10
>>> dates = pd.date_range('2020-01-01', periods=data_points, freq='D')
>>> state_city = [('USA', 'Washington'), ('France', 'Paris'), ('Germany', 'Berlin')]
>>>
>>> df = pd.DataFrame()
>>> for _ in range(data_points):
... state, city = state_city[rand.choice(len(state_city))]
... df_row = pd.DataFrame(
... {
... 'time' : rand.choice(dates),
... 'state': state,
... 'city': city,
... 'val1': rand.randint(0, data_points),
... 'val2': rand.randint(0, data_points)
... }, index=[0]
... )
...
... df = pd.concat([df, df_row], ignore_index=True)
...
>>> df = df.sort_values(['time', 'state', 'city']).reset_index(drop=True)
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['state']] = pd.NA
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['city']] = pd.NA
>>> df.val1 = df.val1.where(df.val1 < 5, pd.NA)
>>> df.val2 = df.val2.where(df.val2 < 5, pd.NA)
>>>
>>> df
time state city val1 val2
0 2020-01-03 USA Washington 4 2
1 2020-01-04 France <NA> <NA> 1
2 2020-01-04 Germany Berlin <NA> 4
3 2020-01-05 Germany Berlin <NA> <NA>
4 2020-01-06 France Paris 1 4
5 2020-01-06 Germany Berlin 4 1
6 2020-01-08 Germany Berlin 4 3
7 2020-01-10 Germany Berlin 2 <NA>
8 2020-01-10 <NA> Washington <NA> <NA>
9 2020-01-10 <NA> Washington 2 <NA>
>>>
您可以看到其中有一些值。我想尽可能地估算州/城市的价值。为此,我将生成可以提供帮助的数据框。
>>> known_state_city = df[['state', 'city']].dropna().drop_duplicates()
>>> known_state_city
state city
0 USA Washington
2 Germany Berlin
4 France Paris
好的,现在我们有了所有州/市的组合。
如何在已知城市时使用known_state_city数据框来填充空白状态? 我可以找到填充城市的空州:
>>> df.loc[df.state.isna() & df.city.notna(), 'city']
8 Washington
9 Washington
Name: city, dtype: object
但是如何在不破坏索引值(8和9)的情况下用来自known_state_city的状态替换华盛顿,以替换df.state值? 如果我在known_state_city中没有所有组合,如何用我拥有的内容更新df中的状态?
答案 0 :(得分:1)
我们可以对fillna
做两次map
:
# fill empty state
df['state'] = df['state'].fillna(df['city'].map(known_state_city.set_index('city')['state']))
# fill empty city
df['city'] = df['city'].fillna(df['state'].map(known_state_city.set_index('state')['city']))
输出:
time state city val1 val2
0 2020-01-03 USA Washington 4.0 2.0
1 2020-01-04 France Paris NaN 1.0
2 2020-01-04 Germany Berlin NaN 4.0
3 2020-01-05 Germany Berlin NaN NaN
4 2020-01-06 France Paris 1.0 4.0
5 2020-01-06 Germany Berlin 4.0 1.0
6 2020-01-08 Germany Berlin 4.0 3.0
7 2020-01-10 Germany Berlin 2.0 NaN
8 2020-01-10 USA Washington NaN NaN
9 2020-01-10 USA Washington 2.0 NaN