Question

我正在学习熊猫的前几周，需要解决我所遇到的问题。我有以下列出的2个数据框：

df1 = pd.DataFrame({
    'City': ['Chicago','Atlanta', 'Dallas', 'Atlanta', 'Chicago', 'Boston', 'Dallas','El Paso','Atlanta'],
    'State': ['IL','GA','TX','GA','IL','MA','TX','TX','GA'],
    'Population': [8865000,523738,6301000,523738,8865000,4309000,6301000,951000,523738]
}, columns=['City', 'State', 'Population'])

df1

    City    State   Population
0   Chicago IL     8865000
1   Atlanta GA     523738
2   Dallas  TX     6301000
3   Atlanta GA     523738
4   Chicago IL     8865000
5   Boston  MA     4309000
6   Dallas  TX     6301000
7   El Paso TX     951000
8   Atlanta GA     523738

。

df2 = pd.DataFrame({
    'Airport': ['Hartsfield','Logan','O Hare','DFW'],
    'M_Code': [78,26,52,39]
},index=[
    'Atlanta',
    'Boston',
    'Chicago',
    'Dallas'])


df2

          Airport        M_Code
Atlanta   Hartsfield     78
Boston    Logan          26
Chicago   O Hare         52
Dallas    DFW            39

预期输出为：

df1

    City    State   Population  M_Code  City_indexed_in_df2
0   Chicago IL      8865000     52      True
1   Atlanta GA      523738      78      True
2   Dallas  TX      6301000     39      True
3   Atlanta GA      523738      78      True
4   Chicago IL      8865000     52      True
5   Boston  MA      4309000     26      True
6   Dallas  TX      6301000     39      True
7   El Paso TX      951000      NaN     False
8   Atlanta GA      523738      78      True

我从开始：

df1.loc[df1.City.isin(df2.index),:]

    City    State   Population
0   Chicago IL  8865000
1   Atlanta GA  523738
2   Dallas  TX  6301000
3   Atlanta GA  523738
4   Chicago IL  8865000
5   Boston  MA  4309000
6   Dallas  TX  6301000
8   Atlanta GA  523738

与预期的一样，此过滤器会滤除El Paso的行。但是我无法拿出代码来做到这一点-> 对于每个df1.City，我都需要查找df2.index，如果找到的话：

提取df2.M_Code并将其值插入新列df1.M_Code
将布尔结果插入新列df1.City_indexed_in_df2

有人可以帮助我实现这一目标吗？另外，我的想法是从df1.City创建一个唯一的数组，然后在df2.index上进行查找可能会提高性能（作为一个新手，除了提取下面的唯一数组外，我还没有想办法。）

arr = df1.City.unique()

array(['Chicago', 'Atlanta', 'Dallas', 'Boston', 'El Paso'], dtype=object)

关于更改解决方案方法的建议也将很棒。

Answer 1

您可以使用merge和how='left'，然后使用notna()创建新列：

df = df1.merge(df2, left_on=['City'], right_index=True, how='left')
df['City_indexed_in_df2'] = df['M_Code'].notna()
print(df)

      City State  Population     Airport  M_Code  City_indexed_in_df2
0  Chicago    IL     8865000      O Hare    52.0                 True
1  Atlanta    GA      523738  Hartsfield    78.0                 True
2   Dallas    TX     6301000         DFW    39.0                 True
3  Atlanta    GA      523738  Hartsfield    78.0                 True
4  Chicago    IL     8865000      O Hare    52.0                 True
5   Boston    MA     4309000       Logan    26.0                 True
6   Dallas    TX     6301000         DFW    39.0                 True
7  El Paso    TX      951000         NaN     NaN                False
8  Atlanta    GA      523738  Hartsfield    78.0                 True

将一个数据框中的列值映射到另一个数据框的索引并提取值

1 个答案: