Question

我有一个熊猫数据框：

street_name        eircode
Malborough Road    BLT12
123 Fake Road      NaN
My Street          NaN

我想根据以下条件创建另一个称为 unique 的列：

如果有eircode，请在 unique 列（然后）
如果没有eircode，请检查街道名称中的第一个字符串：
- 如果第一个字符串是数字，请在唯一列中返回“ yes”
- 如果不是，请在唯一列中返回“ no”

我想出了以下解决方案：

我将街道名称和 eircode
使用lambda函数获取第一个字符串
定义了要应用于数据框的标记功能

# change data types df['eircode'] = df['eircode'].astype('str') df['street_name'] = df['street_name'].astype('str')

# get the first string from street_name column df['first_str'] = df['street_name'].apply(lambda x: x.split()[0])

def tagging(x):
if x['eircode'] != 'nan':
    return 'yes'
elif x['first_str'].isdigit() == True:
    return 'yes'
else:
    return 'no'

df['unique'] = df.apply(tagging, axis=1)

与此有关的问题是，我必须更改数据类型，然后必须创建单独的列。是否有更优雅的方法或更简洁的方法来达到相同的结果？

Answer 1

对于Pandas，最好使用按列计算； apply和自定义函数代表了效率低下的Python级逐行循环。

df = pd.DataFrame({'street_name': ['Malborough Road', '123 Fake Road', 'My Street'],
                   'eircode': ['BLT12', None, None]})

cond1 = df['eircode'].isnull()
cond2 = ~df['street_name'].str.split(n=1).str[0].str.isdigit()

df['unique'] = np.where(cond1 & cond2, 'no', 'yes')

print(df)

  eircode      street_name unique
0   BLT12  Malborough Road    yes
1    None    123 Fake Road    yes
2    None        My Street     no

Answer 2

您可以使用|运算符提供这些单独的条件，然后将结果布尔数组映射到yes和no。第一个条件看起来eircode是否为空，第二个条件使用正则表达式检查street_name是否以数字开头：

df['unique'] = ((~df.eircode.isnull()) | (df.street_name.str.match('^[0-9]'))).map({True:'yes',False:'no'})
>>> df
       street_name eircode unique
0  Malborough Road   BLT12    yes
1    123 Fake Road     NaN    yes
2        My Street     NaN     no

根据其他列值标记行

2 个答案: