这是参考我之前的问题Singular and plural phrase matching in pandas。由于预期的功能不是通过其他人提供的帮助实现的,因此我使用我遵循的方法和实际需要实现的方法发布它。
以下是两个短语数据集和代码。
ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond"])
df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg"])
我只需要将成分系列中的短语与DataFrame中的短语相匹配。作为伪代码,
如果在DataFrame中的短语中找到成分(单数或复数), 返回成分。否则,返回false。
我根据我提出的其他问题中的说明开发了一个代码。
results=ingredients.apply(lambda x: any(df[0].str.lower().str.contains(x.lower())))
df["existence"]=results
df
我的代码存在的问题是它只检查系列中的项目数并停止查找。我真正需要的结果如下,
0 existence
0 1 teaspoons vanilla extract vanilla
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can..... False
5 6 ounces smoke-flavored almonds..... almond
6 sdfgsfgsf False
7 fsfgsgsfgfg False
任何人都可以告诉我应该如何实现此功能?我花了几天测试它,但最后没有运气。谢谢大家。
答案 0 :(得分:1)
In [131]:
df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),,
K[...,np.newaxis], '').T)
print df
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream...
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf
7 fsfgsgsfgfg
有两个步骤:
In [138]:
#check if each ingredients in found
np.char.count(V, K[...,np.newaxis])
Out[138]:
array([[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0]])
In [139]:
#if it is found, grab its name
np.where(np.char.count(V, K[...,np.newaxis]),
K[...,np.newaxis], '').T
Out[139]:
array([['vanilla extract', '', '', '', ''],
['', '', '', 'egg', ''],
['', 'walnut', '', '', ''],
['', '', 'oat', '', ''],
['', '', '', '', ''],
['', '', '', '', 'almond'],
['', '', '', '', ''],
['', '', '', '', '']],
dtype='|S15')