如何根据条件将NaN值转换为分类值。我在尝试转换Nan值时遇到错误。
category gender sub-category title
health&beauty NaN makeup lipbalm
health&beauty women makeup lipstick
NaN NaN NaN lipgloss
我的DataFrame看起来像这样。我将性别中的NaN值转换为分类值的功能类似于
def impute_gender(cols):
category=cols[0]
sub_category=cols[2]
gender=cols[1]
title=cols[3]
if title.str.contains('Lip') and gender.isnull==True:
return 'women'
df[['category','gender','sub_category','title']].apply(impute_gender,axis=1)
如果我运行代码我会收到错误
----> 7 if title.str.contains('Lip') and gender.isnull()==True:
8 print(gender)
9
AttributeError: ("'str' object has no attribute 'str'", 'occurred at index category')
答案 0 :(得分:13)
有些事情需要注意 -
apply
超过4列是浪费apply
是浪费的,因为它很慢并且没有向你提供任何矢量化的好处.str
对象那样使用pd.Series
访问器。 title.contains
就足够了。或者更热情地,"lip" in title
。gender.isnull
完全错误,gender
是标量,没有isnull
属性 选项1
np.where
m = df.gender.isnull() & df.title.str.contains('lip')
df['gender'] = np.where(m, 'women', df.gender)
df
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
这不仅快,而且更简单。如果您担心区分大小写,可以使contains
检查不区分大小写 -
m = df.gender.isnull() & df.title.str.contains('lip', flags=re.IGNORECASE)
选项2
另一种方法是使用pd.Series.mask
/ pd.Series.where
-
df['gender'] = df.gender.mask(m, 'women')
或者,
df['gender'] = df.gender.where(~m, 'women')
df
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
mask
根据提供的掩码隐式将新值应用于列。
答案 1 :(得分:6)
或者只是使用loc作为@ COLDSPEED的回答
的选项3cond = (df['gender'].isnull()) & (df['title'].str.contains('lip'))
df.loc[cond, 'gender'] = 'women'
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss
答案 2 :(得分:3)
如果我们使用NaN值,fillna
可以是方法之一: - )
df.gender=df.gender.fillna(df.title.str.contains('lip').replace(True,'women'))
df
Out[63]:
category gender sub-category title
0 health&beauty women makeup lipbalm
1 health&beauty women makeup lipstick
2 NaN women NaN lipgloss