背景
我有一个类似于以下内容的数据集:
product_name price
Women's pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue Shirt 30.00
...
我希望创建一个新列
性别
将基于product_name中的字符串包含值Women,Men或Unisex
所需的结果如下:
product_name price gender
Women's pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue Shirt 30.00 unisex
我的方法
我认为首先应该创建一个新列,每一行的空白值都应为空白。然后,我应该遍历数据帧中的每一行,并检查字符串df [product_name],看看它是男装,女装还是男女通用,并填写相应的性别行值。
这是我的代码:
df['gender'] = ""
for product_name in df['product_name']:
if 'women' in product_name.lower():
df['gender'] = 'women'
elif 'men' in product_name.lower():
df['gender'] = 'men'
else:
df['gender'] = 'unisex'
但是,我得到以下结果:
product_name price gender
Women's pant 20.00 men
Men's Shirt 30.00 men
Women's Dress 40.00 men
Blue Shirt 30.00 men
我真的很感谢这里的帮助,因为我是python和pandas库的新手。
答案 0 :(得分:4)
您可以对if/else
使用列表推导来获取输出:
df['gender'] = ['women' if 'women' in word
else "men" if "men" in word
else "unisex"
for word in df.product_name.str.lower()]
df
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
或者,您可以使用numpy select获得相同的结果:
cond1 = df.product_name.str.lower().str.contains("women")
cond2 = df.product_name.str.lower().str.contains("men")
condlist = [cond1, cond2]
choicelist = ["women", "men"]
df["gender"] = np.select(condlist, choicelist, default="unisex")
通常,对于字符串,python的迭代要快得多;但是您必须进行测试。
答案 1 :(得分:2)
尝试将for
语句转换为函数并使用apply
。所以像-
def label_gender(product_name):
'''product_name is a str'''
if 'women' in product_name.lower():
return 'women'
elif 'men' in product_name.lower():
return 'men'
else:
return 'unisex'
df['gender'] = df.apply(lambda x: label_gender(x['product_name']),axis=1)
可以在以下位置找到使用apply / lambda的详细信息:https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7
答案 2 :(得分:2)
您还可以使用np.where
+ Series.str.contains
,
import numpy as np
df['gender'] = (
np.where(df.product_name.str.contains("women", case=False), 'women',
np.where(df.product_name.str.contains("men", case=False), "men", 'unisex'))
)
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
答案 3 :(得分:1)
在词组中使用np.where
.str.contains
和regex first
单词`。这样;
#np.where(if product_name has WomenORMen, 1st Word in Phrase, otherwise;unisex)
df['Gender']=np.where(df.product_name.str.contains('Women|Men')\
,df.product_name.str.split('(^[\w]+)').str[1],'Unisex')
product_name price gender
0 Women's pant 20.0 Women
1 Men's Shirt 30.0 Men
2 Women's Dress 640.0 Women
3 Blue Shirt 30.0 Unisex