Pandas:检查Dataframe的每个单元格中是否存在值

时间:2018-02-19 13:40:57

标签: python pandas dataframe apply

我使用python 2.7并希望根据每个单元格中每个列表值的存在来创建一列。

这是一个数据示例:

|    query      |
-----------------
| handbag woman |
| shoe man      |
| t-shirt baby  |
| watch unisex  |
| dress         |

我有一个我要检查的值列表:

gender_list=['woman', 'man', 'baby', 'unisex']

我期望的结果:

|   query      |   gender
-----------------------
|   handbag    |  woman 
|   shoe       |  man      
|   t-shirt    |  baby  
|   watch      |  unisex  
|   dress      |  None

这是我已经尝试过的事情:

for gender in gender_list:
    df['gender']=df['query'].map(lambda x : gender if (x.find(gender) != -1) else None)
    df['query']=df['query'].map(lambda x : x.replace(gender, '').strip() if (x.find(gender) != -1) else x)

2 个答案:

答案 0 :(得分:1)

首先在熊猫中最好不使用循环,因为缓慢(应用是引擎盖下的循环)而是使用矢量化解决方案。

使用extractreplace按正则表达式|加入所有值,并使用word boundary进行完全匹配:

gender_list=['woman', 'man', 'baby', 'unisex']
#exact match is not important
#pat = '|'.join(gender_list)
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
print (pat)
\bwoman\b|\bman\b|\bbaby\b|\bunisex\b

df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
     query  gender
0  handbag   woman
1     shoe     man
2  t-shirt    baby
3    watch  unisex
4    dress     NaN

的差异:

print (df)
           query
0  handbag woman
1      shoe many <-man change to many
2   t-shirt baby
3   watch unisex
4          dress

gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
       query  gender
0    handbag   woman
1  shoe many     NaN <-many not extracted
2    t-shirt    baby
3      watch  unisex
4      dress     NaN
gender_list=['woman', 'man', 'baby', 'unisex']
pat = '|'.join(gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()
print (df)
     query  gender
0  handbag   woman
1   shoe y     man <-stay y from many
2  t-shirt    baby
3    watch  unisex
4    dress     NaN

<强>计时

df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})
print (df)

df = pd.concat([df] * 10000, ignore_index=True)


In [299]: %%timeit
     ...: pat = '|'.join(r"\b{}\b".format(x) for x in gender_list)
     ...: df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
     ...: df['query'] = df['query'].str.replace(pat, '').str.strip()
     ...: 
     ...: 
1 loop, best of 3: 143 ms per loop

In [300]: %%timeit
     ...: gender_set = set(gender_list)
     ...: 
     ...: def gender_sep(row):
     ...:     lst = row['query'].split(' ')
     ...:     gender = next(iter(gender_set & set(lst)), None)
     ...:     return (' '.join(lst), None) if not gender else \
     ...:            (' '.join(i for i in lst if i!= gender), gender)
     ...: 
     ...: df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))
     ...: 
1 loop, best of 3: 933 ms per loop

编辑:

对于更常见的一般解决方案,需要re.escape转义正则表达式值:

import re

gender_list=['woman', 'man', 'baby', 'girl & boy']
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in gender_list)
df['gender'] = df['query'].str.extract('('+ pat + ')', expand=False)
df['query'] = df['query'].str.replace(pat, '').str.strip()

答案 1 :(得分:1)

这是一种方式。它不是最有效的,但它易读且易于维护。

import pandas as pd

df = pd.DataFrame({'query': ['handbag woman', 'shoe man', 't-shirt baby', 'watch unisex', 'dress', 'manpower']})

gender_list = ['woman', 'man', 'baby', 'unisex']
gender_set = set(gender_list)

def gender_sep(row):
    lst = row['query'].split(' ')
    gender = next(iter(gender_set & set(lst)), None)
    return (' '.join(lst), None) if not gender else \
           (' '.join(i for i in lst if i!= gender), gender)

df['query'], df['gender'] = list(zip(*df.apply(gender_sep, axis=1)))

#       query  gender
# 0   handbag   woman
# 1      shoe     man
# 2   t-shirt    baby
# 3     watch  unisex
# 4     dress    None
# 5  manpower    None