我已经为以下问题苦苦挣扎了很长一段时间,希望得到任何帮助。
我想在“国家”上合并 df1 和 df2。
df1.head()
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
| | loan_theme_id | partner_id | field_partner_name | loan_theme_type | location_name | lat | lon | rural_pct | city | region | country |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
| 0 | a1050000000wDrQ | 175 | Koret Israel Economic Development Funds (KIEDF) | Underserved | Abu Sanaan, Israel | 32.958030 | 35.171969 | 0.0 | Abu Sanaan | Israel | Israel |
| 1 | a1050000007S5Kt | 485 | Building Markets | SME | Yangon, Myanmar (Burma) | 16.866069 | 96.195132 | NaN | Yangon | Myanmar (Burma) | Myanmar (Burma) |
| 2 | a1050000002YCWe | 369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV) | Artisan | Chajul, Guatemala | 15.483483 | -91.037070 | NaN | Chajul | Guatemala | Guatemala |
| 3 | a1050000007qJuI | 77 | Al Majmoua | Vulnerable Populations (Syrian)2 | Aley, Lebanon | 33.810086 | 35.597326 | 43.0 | Aley | Lebanon | Lebanon |
| 4 | a1050000006FnC9 | 357 | Alivio Capital | Imagen Dental | Matamoros,Tamps, Mexico | 25.869029 | -97.502738 | 3.0 | Matamoros | Tamps | Mexico |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+
这里是 df1 的列类型
Int64Index: 100 entries, 108 to 549
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_theme_id 100 non-null category
1 partner_id 100 non-null category
2 field_partner_name 100 non-null string
3 loan_theme_type 100 non-null category
4 location_name 100 non-null string
5 lat 100 non-null float64
6 lon 100 non-null float64
7 rural_pct 79 non-null float64
8 city 100 non-null string
9 region 100 non-null string
10 country 100 non-null string
dtypes: category(3), float64(3), string(5)
memory usage: 19.2 KB
df2.head()
+---+-------------+-------------------------+----------+
| | country | world_region | MPI |
+---+-------------+-------------------------+----------+
| 0 | Afghanistan | South Asia | 0.309853 |
| 1 | Albania | Europe and Central Asia | NaN |
| 2 | Algeria | Arab States | NaN |
| 3 | Armenia | Europe and Central Asia | NaN |
| 4 | Azerbaijan | Europe and Central Asia | NaN |
+---+-------------+-------------------------+----------+
列类型:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 101
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 102 non-null string
1 world_region 102 non-null object
2 MPI 78 non-null float64
dtypes: float64(1), object(1), string(1)
memory usage: 3.2+ KB
确保至少有一些重叠:
display(df2[(df2.country == 'Guatemala')])
+----+-----------+-----------------------------+----------+
| | country | world_region | MPI |
+----+-----------+-----------------------------+----------+
| 34 | Guatemala | Latin America and Caribbean | 0.113957 |
+----+-----------+-----------------------------+----------+
合并:
df3 = pd.merge(df1, df2, on='country', how='left')
df3.head()
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
| | loan_theme_id | partner_id | field_partner_name | loan_theme_type | location_name | lat | lon | rural_pct | city | region | country | world_region | MPI |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
| 0 | a1050000000wDrQ | 175 | Koret Israel Economic Development Funds (KIEDF) | Underserved | Abu Sanaan, Israel | 32.958030 | 35.171969 | 0.0 | Abu Sanaan | Israel | Israel | NaN | NaN |
| 1 | a1050000007S5Kt | 485 | Building Markets | SME | Yangon, Myanmar (Burma) | 16.866069 | 96.195132 | NaN | Yangon | Myanmar (Burma) | Myanmar (Burma) | NaN | NaN |
| 2 | a1050000002YCWe | 369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV) | Artisan | Chajul, Guatemala | 15.483483 | -91.037070 | NaN | Chajul | Guatemala | Guatemala | NaN | NaN |
| 3 | a1050000007qJuI | 77 | Al Majmoua | Vulnerable Populations (Syrian)2 | Aley, Lebanon | 33.810086 | 35.597326 | 43.0 | Aley | Lebanon | Lebanon | NaN | NaN |
| 4 | a1050000006FnC9 | 357 | Alivio Capital | Imagen Dental | Matamoros,Tamps, Mexico | 25.869029 | -97.502738 | 3.0 | Matamoros | Tamps | Mexico | NaN | NaN |
+---+-----------------+------------+-------------------------------------------------+----------------------------------+-------------------------+-----------+------------+-----------+------------+-----------------+-----------------+--------------+-----+
列类型
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 loan_theme_id 100 non-null category
1 partner_id 100 non-null category
2 field_partner_name 100 non-null string
3 loan_theme_type 100 non-null category
4 location_name 100 non-null string
5 lat 100 non-null float64
6 lon 100 non-null float64
7 rural_pct 79 non-null float64
8 city 100 non-null string
9 region 100 non-null string
10 country 100 non-null string
11 world_region 0 non-null object
12 MPI 0 non-null float64
我真的不明白为什么 world_region 和 MPI 中的结果是 NaN。我确保国家/地区的 df1 和 df2 中没有 NaN,并且至少存在某种重叠。列类型也匹配。
编辑:
感谢保罗,我尝试检索有关例如的信息df1 中的“危地马拉”。我们可以在上表中看到它实际上存在于 df1 中。但是,运行 display(df2[(df2.country == 'Guatemala')])
会返回一个空数据帧。所以我尝试运行 display(df2[(df2.country == ' Guatemala')]),在开始处有一个额外的空间,现在我们得到了一些结果:
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
| | loan_theme_id | partner_id | field_partner_name | loan_theme_type | location_name | lat | lon | rural_pct | city | region | country |
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
| 2 | a1050000002YCWe | 369 | AsociaciÍ_n Chajulense de Mujeres (ACMUV) | Artisan | Chajul, Guatemala | 15.483483 | -91.03707 | NaN | Chajul | Guatemala | Guatemala |
+---+-----------------+------------+-------------------------------------------+-----------------+-------------------+-----------+-----------+-----------+--------+-----------+-----------+
Pandas 中是否有一个函数可以检查 df 列中的空格,这会导致问题吗?
答案 0 :(得分:0)
您正在执行合并命令中由 left
关键字指定的左连接。这意味着如果右边的数据框没有左边一行的国家,你会得到 NaN。
有关联接类型和左联接的详细信息,请参见此处的示例:https://www.w3schools.com/sql/sql_join_left.asp
编辑:
这是因为在其中一个数据帧中,字符串周围有一个额外的空格。在加入之前,您可以使用 trim()
函数删除空格。