Question

我有一个数据框

   state   country
0  tx      us
1  ab      ca
2  fl      
3          
4  qc      ca
5  dawd

我正在尝试创建一个函数，该函数将检查country列中是否有值。如果country中没有值，请检查state中的值是加拿大还是美国的缩写。如果是加拿大/美国的缩写，则将正确的国家/地区名称分配给该行的country列。

例如，在上面的示例DF中，该函数将看到row 2中的country为空。然后就会看到state，fl是我们的一部分。然后，它将国家/地区指定为us。

我认为这可以通过pd.apply()来完成，但是我在执行时遇到了麻烦。

我一直在玩下面的代码，但是我做错了事...

def country_identifier(country):
    states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", 
              "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", 
              "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
    provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]
    if country["country"] not None:
        if country["state"] in states:
            return "us"
        elif country["state"] in provinces:
            return "ca"
    else:
        return country

df2 = df[["country", "state"]].apply(country_identifier)
df2

Answer 1

您不需要使用嵌套的np.where条件，因为这对可以检查的条件给出了硬限制。使用df.loc，除非您的条件列表扩展得很厉害；它会比apply

import pandas as pd
import numpy as np

states = ["AK", "AL", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", 
              "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", 
              "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
provinces = ["ON", "BC", "AB", "MB", "NB", "QC", "NL", "NT", "NS", "PE", "YT", "NU", "SK"]

df = pd.DataFrame({'country': {0: 'us', 1: 'ca', 2: np.nan, 3: np.nan, 4: 'ca', 5: np.nan},
                   'state': {0: 'tx', 1: 'ab', 2: 'fl', 3: np.nan, 4: 'qc', 5: 'dawd'}})

df.loc[(df['country'].isnull()) 
       & (df['state'].str.upper().isin(states)), 'country'] = 'us'

df.loc[(df['country'].isnull()) 
       & (df['state'].str.upper().isin(provinces)), 'country'] = 'ca'

它是可扩展的，因为我可以使用多种方法来生成字典，然后对替换进行概括。

conditions = {'ca': provinces, 'us': states}

for country, values in conditions.items():
    df.loc[(df['country'].isnull()) 
           & (df['state'].str.upper().isin(values)), 'country'] = country

Answer 2

您可以使用嵌套的np.where，

df['country'] = np.where(df['state'].str.upper().isin(states), 'us', np.where(df['state'].str.upper().isin(provinces), 'ca', np.nan))

    state   country
0   tx      us
1   ab      ca
2   fl      us
3   None    nan
4   qc      ca

编辑：首先包含国家/地区的支票

cond1 = df.loc[df['country'].isnull(), 'state'].str.upper().isin(states)
cond2 = df.loc[df['country'].isnull(), 'state'].str.upper().isin(provinces)
df.loc[df['country'].isnull(), 'country'] = np.where(cond1, 'us', np.where(cond2, 'ca', np.nan))



    state   country
0   tx      us
1   ab      ca
2   fl      us
3   NaN     nan
4   qc      ca
5   dawd    nan

另一种使用numpy select的方式；一根衬纸，可以在多种条件下很好地缩放，

df.loc[df['country'].isnull(), 'country'] = np.select([cond1, cond2], ['us', 'ca'], np.nan)

熊猫根据另一列中的值应用基于值的值

2 个答案: