与熊猫匹配的单数和复数单词

时间:2015-09-13 06:17:02

标签: python regex pandas

此问题是对我之前问题Multiple Phrases Matching Python Pandas的扩展。虽然我在解决问题的答案之后想出了方法,但出现了单数和复数单词的一些典型问题。

ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])

df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

我只需要将成分系列中的短语与DataFrame中的短语相匹配。作为伪代码,

  

如果在DataFrame中的短语中找到成分(单数或复数),   返回成分。否则,返回false。

这是通过如下答案实现的,

df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)

我还应用了以下来用NAN填充空单元格,以便我可以轻松过滤掉数据。

df.ix[df.existence=='', 'existence'] = np.nan

结果我们如下,

print df
                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           NaN

这一直是正确的,但是当单数和复数单词映射不像almond => almonds apple => apples。当某些内容显示为strawberry => strawberries时,此代码会将其识别为NaN

改进我的代码以检测此类事件。我喜欢将我的成分Series更改为data Frame,如下所示。

#ingredients

#inputwords       #outputword

vanilla extract    vanilla extract 
walnut             walnut
walnuts            walnut
oat                oat
oats               oat
egg                egg
eggs               egg
almond             almond
almonds            almond
strawberry         strawberry
strawberries       strawberry
cherry             cherry
cherries           cherry

所以我的逻辑就是#inputwords中的一个单词出现在我希望在另一个单元格中返回单词的短语中。换句话说,当短语中出现strawberrystrawberries时,代码就会将单词放在旁边strawberry。所以我的最终结果将是

                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           strawberry

我找不到将此功能合并到现有代码或编写新代码的方法。任何人都可以帮我这个吗?

2 个答案:

答案 0 :(得分:1)

考虑使用词干分析器:) http://www.nltk.org/howto/stem.html

直接从他们的页面中删除:

    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
    >>> print(stemmer.stem("having"))
    have
    >>> print(stemmer2.stem("having"))
    having

重构您的代码以阻止句子中的所有单词,然后将其与成分列表匹配。

nltk是一款非常棒的工具,可以满足您的需求!

干杯

答案 1 :(得分:0)

# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] , 
          data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
    match = np.nan
    for key , value in mapping.iterkv():
        if key in df[0]:
            match = value
    return match
# apply this function on each row
df.apply(get_match, axis = 1)