Python熊猫,如何将数据框的NA值替换为在另一个数据框中查找的值?

时间:2020-07-17 18:23:09

标签: python-3.x pandas replace

假设我有一个数据框:

>>> import pandas as pd
>>> import numpy as np
>>> rand = np.random.RandomState(42)
>>> data_points = 10
>>> dates = pd.date_range('2020-01-01', periods=data_points, freq='D')
>>> state_city = [('USA', 'Washington'), ('France', 'Paris'), ('Germany', 'Berlin')]
>>>
>>> df = pd.DataFrame()
>>> for _ in range(data_points):
...     state, city = state_city[rand.choice(len(state_city))]
...     df_row = pd.DataFrame(
...         {
...             'time' : rand.choice(dates),
...             'state': state,
...             'city': city,
...             'val1': rand.randint(0, data_points),
...             'val2': rand.randint(0, data_points)
...         }, index=[0]
...     )
...
...     df = pd.concat([df, df_row], ignore_index=True)
...
>>> df = df.sort_values(['time', 'state', 'city']).reset_index(drop=True)
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['state']] = pd.NA
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['city']] = pd.NA
>>> df.val1 = df.val1.where(df.val1 < 5, pd.NA)
>>> df.val2 = df.val2.where(df.val2 < 5, pd.NA)
>>>
>>> df
        time    state        city  val1  val2
0 2020-01-03      USA  Washington     4     2
1 2020-01-04   France        <NA>  <NA>     1
2 2020-01-04  Germany      Berlin  <NA>     4
3 2020-01-05  Germany      Berlin  <NA>  <NA>
4 2020-01-06   France       Paris     1     4
5 2020-01-06  Germany      Berlin     4     1
6 2020-01-08  Germany      Berlin     4     3
7 2020-01-10  Germany      Berlin     2  <NA>
8 2020-01-10     <NA>  Washington  <NA>  <NA>
9 2020-01-10     <NA>  Washington     2  <NA>
>>>

您可以看到其中有一些值。我想尽可能地估算州/城市的价值。为此,我将生成可以提供帮助的数据框。

>>> known_state_city = df[['state', 'city']].dropna().drop_duplicates()
>>> known_state_city
     state        city
0      USA  Washington
2  Germany      Berlin
4   France       Paris

好的,现在我们有了所有州/市的组合。

如何在已知城市时使用known_state_city数据框来填充空白状态? 我可以找到填充城市的空州:

>>> df.loc[df.state.isna() & df.city.notna(), 'city']
8    Washington
9    Washington
Name: city, dtype: object

但是如何在不破坏索引值(8和9)的情况下用来自known_state_city的状态替换华盛顿,以替换df.state值? 如果我在known_state_city中没有所有组合,如何用我拥有的内容更新df中的状态?

1 个答案:

答案 0 :(得分:1)

我们可以对fillna做两次map

# fill empty state
df['state'] = df['state'].fillna(df['city'].map(known_state_city.set_index('city')['state']))

# fill empty city
df['city'] = df['city'].fillna(df['state'].map(known_state_city.set_index('state')['city']))

输出:

         time    state        city  val1  val2
0  2020-01-03      USA  Washington   4.0   2.0
1  2020-01-04   France       Paris   NaN   1.0
2  2020-01-04  Germany      Berlin   NaN   4.0
3  2020-01-05  Germany      Berlin   NaN   NaN
4  2020-01-06   France       Paris   1.0   4.0
5  2020-01-06  Germany      Berlin   4.0   1.0
6  2020-01-08  Germany      Berlin   4.0   3.0
7  2020-01-10  Germany      Berlin   2.0   NaN
8  2020-01-10      USA  Washington   NaN   NaN
9  2020-01-10      USA  Washington   2.0   NaN