匹配Python熊猫的多个短语

时间:2015-08-31 12:14:26

标签: python regex pandas

这是参考我之前的问题Singular and plural phrase matching in pandas。由于预期的功能不是通过其他人提供的帮助实现的,因此我使用我遵循的方法和实际需要实现的方法发布它。

以下是两个短语数据集和代码。

ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond"])

df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg"])

我只需要将成分系列中的短语与DataFrame中的短语相匹配。作为伪代码,

  

如果在DataFrame中的短语中找到成分(单数或复数),   返回成分。否则,返回false。

我根据我提出的其他问题中的说明开发了一个代码。

results=ingredients.apply(lambda x: any(df[0].str.lower().str.contains(x.lower())))
df["existence"]=results
df

enter image description here

我的代码存在的问题是它只检查系列中的项目数并停止查找。我真正需要的结果如下,

    0                                            existence
0   1 teaspoons vanilla extract                  vanilla
1   2 eggs                                       egg
2   3 cups chopped walnuts                       walnut
3   4 cups rolled oats                           oat
4   1 (10.75 ounce) can.....                     False
5   6 ounces smoke-flavored almonds.....         almond
6   sdfgsfgsf                                    False
7   fsfgsgsfgfg                                  False

任何人都可以告诉我应该如何实现此功能?我花了几天测试它,但最后没有运气。谢谢大家。

1 个答案:

答案 0 :(得分:1)

结帐numpy string operations

In [131]:

df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),,
                                        K[...,np.newaxis], '').T)
print df
                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...                 
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf                 
7                                        fsfgsgsfgfg     

有两个步骤:

In [138]:
#check if each ingredients in found
np.char.count(V, K[...,np.newaxis])
Out[138]:
array([[1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0]])
In [139]:
#if it is found, grab its name
np.where(np.char.count(V, K[...,np.newaxis]),
                      K[...,np.newaxis], '').T
Out[139]:
array([['vanilla extract', '', '', '', ''],
       ['', '', '', 'egg', ''],
       ['', 'walnut', '', '', ''],
       ['', '', 'oat', '', ''],
       ['', '', '', '', ''],
       ['', '', '', '', 'almond'],
       ['', '', '', '', ''],
       ['', '', '', '', '']], 
      dtype='|S15')