我使用python 2.7并希望根据每个单元格中每个列表值的存在来创建一列。
这是一个数据示例:
| query |
-----------------
| handbag woman |
| shoe man |
| t-shirt baby |
| watch unisex |
| dress |
我有一个我要检查的值列表:
gender_list=['woman', 'man', 'baby', 'unisex']
我期望的结果:
| query | gender
-----------------------
| handbag | woman
| shoe | man
| t-shirt | baby
| watch | unisex
| dress | None
这是我已经尝试过的事情:
for gender in gender_list:
df['gender']=df['query'].map(lambda x : gender if (x.find(gender) != -1) else None)
df['query']=df['query'].map(lambda x : x.replace(gender, '').strip() if (x.find(gender) != -1) else x)
答案 0 :(得分:1)
首先在熊猫中最好不使用循环,因为缓慢(应用是引擎盖下的循环)而是使用矢量化解决方案。
使用extract
和replace
按正则表达式|
加入所有值,并使用word boundary
进行完全匹配:
gender_list=['woman', 'man', 'baby', 'unisex']
#exact match is not important
#pat = '|'.join(gender_list)
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
print (pat)
\bwoman\b|\bman\b|\bbaby\b|\bunisex\b
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe man
2 t-shirt baby
3 watch unisex
4 dress NaN
的差异:
print (df)
query
0 handbag woman
1 shoe many <-man change to many
2 t-shirt baby
3 watch unisex
4 dress
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe many NaN <-many not extracted
2 t-shirt baby
3 watch unisex
4 dress NaN
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
query gender
0 handbag woman
1 shoe y man <-stay y from many
2 t-shirt baby
3 watch unisex
4 dress NaN
<强>计时强>:
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
print (df)
df = pd.concat([df] * 10000, ignore_index=True)
In [299]: %%timeit
...: pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
...: df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
...: df['query'] = df['query'].str.replace(pat, '').str.strip()
...:
...:
1 loop, best of 3: 143 ms per loop
In [300]: %%timeit
...: gender_set = set(gender_list)
...:
...: def gender_sep(row):
...: lst = row['query'].split(' ')
...: gender = next(iter(gender_set & set(lst)), None)
...: return (' '.join(lst), None) if not gender else \
...: (' '.join(i for i in lst if i!= gender), gender)
...:
...: df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
...:
1 loop, best of 3: 933 ms per loop
编辑:
对于更常见的一般解决方案,需要re.escape
转义正则表达式值:
import re
gender_list=['woman', 'man', 'baby', 'girl & boy']
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
答案 1 :(得分:1)
这是一种方式。它不是最有效的,但它易读且易于维护。
import pandas as pd
df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
gender_list = ['woman', 'man', 'baby', 'unisex']
gender_set = set(gender_list)
def gender_sep(row):
lst = row['query'].split(' ')
gender = next(iter(gender_set & set(lst)), None)
return (' '.join(lst), None) if not gender else \
(' '.join(i for i in lst if i!= gender), gender)
df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
# query gender
# 0 handbag woman
# 1 shoe man
# 2 t-shirt baby
# 3 watch unisex
# 4 dress None
# 5 manpower None