如果在另一个数据框列中找到列中的值,则返回值 pandas

时间:2021-02-11 21:27:35

标签: python pandas dataframe filtering

我有两个 dfs。 df1:

              Summary
0        This is a basket of red apples.
1        We found a bushel of fruit. They are red.
2        There is a peck of pears that taste sweet.
3        We have a box of plums.
4        This is bag of green apples.

df2:

      Fruits        
0    plum     
1    pear     
2    apple     
3    orange

我希望输出是:

df2:

      Fruits     Summary   
0    plum        We have a box of plums.
1    pear        There is a peck of pears that taste sweet.
2    apple       This is a basket of red apples, This is bag of green apples
3    orange

简单来说,如果在summary中找到了结果,则summary中的适当值应该返回,否则什么也没有或NaN。

编辑:如果找到多个实例,则应返回所有实例,并用逗号分隔。

1 个答案:

答案 0 :(得分:1)

  • 我认为在每个句子中找到唯一的水果比为每个水果找到每个句子要快。
    • 为每个水果找到每个句子,需要为每个水果迭代每个句子。
    • 据推测,与句子相比,唯一的水果较少,因此在句子中找到水果的速度更快。
    • 与另一路相比的路速是假设,尚未经过测试。
  • 对于每个 'Summary',将所有找到的 'Fruits' 添加到 list,因为一个句子中可能有多个水果。
  • 分解 lists 以分隔行
  • 合并 df1df2
  • Groupby 'Fruits' 并将每个句子组合成逗号分隔的字符串。
import pandas as pd

# sample dataframes
df1 = pd.DataFrame({'Summary': ['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red.', 'There is a peck of pears that taste sweet.', 'We have a box of plums.', 'This is bag of green apples.', 'We have apples and pears']})

df2 = pd.DataFrame({'Fruits': ['plum', 'pear', 'apple', 'orange']})

# display(df1)
                                          Summary
0  This is a basket of red apples. They are sour.
1       We found a bushel of fruit. They are red.
2      There is a peck of pears that taste sweet.
3                         We have a box of plums.
4                    This is bag of green apples.
5                        We have apples and pears

# set all values to lowercase in Fruits
df2.Fruits = df2.Fruits.str.lower()

# create an array of unique Fruits from df2
unique_fruits = df2.Fruits.unique()

# for each sentence check if a fruit is in the sentence and create a list
df1['Fruits'] = df1.Summary.str.lower().apply(lambda x: [v for v in unique_fruits if v in x])

# explode the lists into separate rows; if sentences contain more than one fruit, there will be more than one row
df1 = df1.explode('Fruits').reset_index(drop=True)

# merge df1 to df2
df2_ = df2.merge(df1, on='Fruits', how='left')

# groupby fruit, into a string
df2_ = df2_.groupby('Fruits').Summary.agg(list).str.join(', ').reset_index()

# display(df2_)
   Fruits                                                                                                 Summary
0   apple  This is a basket of red apples. They are sour., This is bag of green apples., We have apples and pears
1  orange                                                                                                     NaN
2    pear                                    There is a peck of pears that taste sweet., We have apples and pears
3    plum                                                                                 We have a box of plums.

替代方案

  • 如前所述,我的假设是这将是较慢的选择,即使代码较少,因为它需要遍历每个句子、每个水果。
df2['Summary'] = df2.Fruits.str.lower().apply(lambda x: ', '.join([v for v in df1.Summary if x in v.lower()]))