此问题是对我之前问题Multiple Phrases Matching Python Pandas的扩展。虽然我在解决问题的答案之后想出了方法,但出现了单数和复数单词的一些典型问题。
ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])
df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
我只需要将成分系列中的短语与DataFrame中的短语相匹配。作为伪代码,
如果在DataFrame中的短语中找到成分(单数或复数), 返回成分。否则,返回false。
这是通过如下答案实现的,
df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)
我还应用了以下来用NAN填充空单元格,以便我可以轻松过滤掉数据。
df.ix[df.existence=='', 'existence'] = np.nan
结果我们如下,
print df
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries NaN
这一直是正确的,但是当单数和复数单词映射不像almond
=> almonds
apple
=> apples
。当某些内容显示为strawberry
=> strawberries
时,此代码会将其识别为NaN
。
改进我的代码以检测此类事件。我喜欢将我的成分Series
更改为data Frame
,如下所示。
#ingredients
#inputwords #outputword
vanilla extract vanilla extract
walnut walnut
walnuts walnut
oat oat
oats oat
egg egg
eggs egg
almond almond
almonds almond
strawberry strawberry
strawberries strawberry
cherry cherry
cherries cherry
所以我的逻辑就是#inputwords
中的一个单词出现在我希望在另一个单元格中返回单词的短语中。换句话说,当短语中出现strawberry
或strawberries
时,代码就会将单词放在旁边strawberry
。所以我的最终结果将是
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries strawberry
我找不到将此功能合并到现有代码或编写新代码的方法。任何人都可以帮我这个吗?
答案 0 :(得分:1)
考虑使用词干分析器:) http://www.nltk.org/howto/stem.html
直接从他们的页面中删除:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
>>> print(stemmer.stem("having"))
have
>>> print(stemmer2.stem("having"))
having
重构您的代码以阻止句子中的所有单词,然后将其与成分列表匹配。
nltk是一款非常棒的工具,可以满足您的需求!
干杯
答案 1 :(得分:0)
# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] ,
data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
match = np.nan
for key , value in mapping.iterkv():
if key in df[0]:
match = value
return match
# apply this function on each row
df.apply(get_match, axis = 1)