在附加列中合并两个 df 结果为 NaN

时间:2021-01-16 20:39:33

标签: python pandas merge

我已经为以下问题苦苦挣扎了很长一段时间,希望得到任何帮助。

我想在“国家”上合并 df1 和 df2。

df1.head()
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
|   |  loan_theme_id  | partner_id |               field_partner_name                |         loan_theme_type          |      location_name      |    lat    |    lon     | rural_pct |    city    |     region      |     country     |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
| 0 | a1050000000wDrQ |        175 | Koret Israel Economic Development Funds (KIEDF) | Underserved                      | Abu Sanaan, Israel      | 32.958030 | 35.171969  | 0.0       | Abu Sanaan | Israel          | Israel          |
| 1 | a1050000007S5Kt |        485 | Building Markets                                | SME                              | Yangon, Myanmar (Burma) | 16.866069 | 96.195132  | NaN       | Yangon     | Myanmar (Burma) | Myanmar (Burma) |
| 2 | a1050000002YCWe |        369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV)       | Artisan                          | Chajul, Guatemala       | 15.483483 | -91.037070 | NaN       | Chajul     | Guatemala       | Guatemala       |
| 3 | a1050000007qJuI |         77 | Al Majmoua                                      | Vulnerable Populations (Syrian)2 | Aley, Lebanon           | 33.810086 | 35.597326  | 43.0      | Aley       | Lebanon         | Lebanon         |
| 4 | a1050000006FnC9 |        357 | Alivio Capital                                  | Imagen Dental                    | Matamoros,Tamps, Mexico | 25.869029 | -97.502738 | 3.0       | Matamoros  | Tamps           | Mexico          |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+

这里是 df1 的列类型

Int64Index: 100 entries, 108 to 549
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   loan_theme_id       100 non-null    category
 1   partner_id          100 non-null    category
 2   field_partner_name  100 non-null    string  
 3   loan_theme_type     100 non-null    category
 4   location_name       100 non-null    string  
 5   lat                 100 non-null    float64 
 6   lon                 100 non-null    float64 
 7   rural_pct           79 non-null     float64 
 8   city                100 non-null    string  
 9   region              100 non-null    string  
 10  country             100 non-null    string  
dtypes: category(3), float64(3), string(5)
memory usage: 19.2 KB
df2.head()
+---+-------------+-------------------------+----------+
|   |   country   |      world_region       |   MPI    |
+---+-------------+-------------------------+----------+
| 0 | Afghanistan | South Asia              | 0.309853 |
| 1 | Albania     | Europe and Central Asia | NaN      |
| 2 | Algeria     | Arab States             | NaN      |
| 3 | Armenia     | Europe and Central Asia | NaN      |
| 4 | Azerbaijan  | Europe and Central Asia | NaN      |
+---+-------------+-------------------------+----------+

列类型:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 101
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country       102 non-null    string 
 1   world_region  102 non-null    object 
 2   MPI           78 non-null     float64
dtypes: float64(1), object(1), string(1)
memory usage: 3.2+ KB

确保至少有一些重叠:

display(df2[(df2.country == 'Guatemala')])
+----+-----------+-----------------------------+----------+
|    |  country  |        world_region         |   MPI    |
+----+-----------+-----------------------------+----------+
| 34 | Guatemala | Latin America and Caribbean | 0.113957 |
+----+-----------+-----------------------------+----------+

合并:

df3 = pd.merge(df1, df2, on='country', how='left')
df3.head()
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
|   |  loan_theme_id  | partner_id |               field_partner_name                |         loan_theme_type          |      location_name      |    lat    |    lon     | rural_pct |    city    |     region      |     country     | world_region | MPI |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
| 0 | a1050000000wDrQ |        175 | Koret Israel Economic Development Funds (KIEDF) | Underserved                      | Abu Sanaan, Israel      | 32.958030 | 35.171969  | 0.0       | Abu Sanaan | Israel          | Israel          | NaN          | NaN |
| 1 | a1050000007S5Kt |        485 | Building Markets                                | SME                              | Yangon, Myanmar (Burma) | 16.866069 | 96.195132  | NaN       | Yangon     | Myanmar (Burma) | Myanmar (Burma) | NaN          | NaN |
| 2 | a1050000002YCWe |        369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV)       | Artisan                          | Chajul, Guatemala       | 15.483483 | -91.037070 | NaN       | Chajul     | Guatemala       | Guatemala       | NaN          | NaN |
| 3 | a1050000007qJuI |         77 | Al Majmoua                                      | Vulnerable Populations (Syrian)2 | Aley, Lebanon           | 33.810086 | 35.597326  | 43.0      | Aley       | Lebanon         | Lebanon         | NaN          | NaN |
| 4 | a1050000006FnC9 |        357 | Alivio Capital                                  | Imagen Dental                    | Matamoros,Tamps, Mexico | 25.869029 | -97.502738 | 3.0       | Matamoros  | Tamps           | Mexico          | NaN          | NaN |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+

列类型

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   loan_theme_id       100 non-null    category
 1   partner_id          100 non-null    category
 2   field_partner_name  100 non-null    string  
 3   loan_theme_type     100 non-null    category
 4   location_name       100 non-null    string  
 5   lat                 100 non-null    float64 
 6   lon                 100 non-null    float64 
 7   rural_pct           79 non-null     float64 
 8   city                100 non-null    string  
 9   region              100 non-null    string  
 10  country             100 non-null    string  
 11  world_region        0 non-null      object  
 12  MPI                 0 non-null      float64 

我真的不明白为什么 world_region 和 MPI 中的结果是 NaN。我确保国家/地区的 df1 和 df2 中没有 NaN,并且至少存在某种重叠。列类型也匹配。

编辑: 感谢保罗,我尝试检索有关例如的信息df1 中的“危地马拉”。我们可以在上表中看到它实际上存在于 df1 中。但是,运行 display(df2[(df2.country == 'Guatemala')]) 会返回一个空数据帧。所以我尝试运行 display(df2[(df2.country == ' Guatemala')]),在开始处有一个额外的空间,现在我们得到了一些结果:

+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
|   |  loan_theme_id  | partner_id |            field_partner_name             | loan_theme_type |   location_name   |    lat    |    lon    | rural_pct |  city  |  region   |  country  |
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
| 2 | a1050000002YCWe |        369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV) | Artisan         | Chajul, Guatemala | 15.483483 | -91.03707 | NaN       | Chajul | Guatemala | Guatemala |
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+

Pandas 中是否有一个函数可以检查 df 列中的空格,这会导致问题吗?

1 个答案:

答案 0 :(得分:0)

您正在执行合并命令中由 left 关键字指定的左连接。这意味着如果右边的数据框没有左边一行的国家,你会得到 NaN。
有关联接类型和左联接的详细信息,请参见此处的示例:https://www.w3schools.com/sql/sql_join_left.asp

编辑:
这是因为在其中一个数据帧中,字符串周围有一个额外的空格。在加入之前,您可以使用 trim() 函数删除空格。