将字符串序列与字符串列表进行比较,并获得子字符串匹配项

时间:2020-06-17 04:30:20

标签: python pandas string dataframe comparison

我想查找与另一个数据帧中的列相比,DataFrame列中是否存在子字符串。

在我的示例DF2['Column y']中,我想要

  • 'manager''Software Developer Manager'
  • 'executive''Online Bidding Executive',依此类推

DF1

      unique_values  counts  Rank  Stop_Word
0       manager    9322   1.0      False
1           for    8463   2.0       True
2     developer    7323   3.0      False
3     executive    5864   4.0      False
4      engineer    5669   5.0      False
5         sales    4492   6.0      False

DF2

                                 ColumnX.                     Column y. 

0                                Digital Media Planner.       Nan. 
1                             Online Bidding Executive.       Executive
2                           Software Developer Manager        Manager
3                                    Technical Support.       Nan
4                    Software Test Engineer -hyderabad.       engineer
5               Opening For Adobe Analytics Specialist.       Nan
6       Sales- Fresher-for Leading Property Consultant.       Nan
7               Opportunity For Azure Devops Architect        Nan
8                                                  BDE.       Nan
9                   Technical Support/ Product Support.       Nan

我想要DF2['Column y']作为输出

此外,如果存在多个子字符串,则必须考虑排名最低的子字符串,就像在DF2上考虑的第二个索引值'manager''developer'一样。

1 个答案:

答案 0 :(得分:0)

我会用apply; apply基本上只是一个将方法应用于每一行或每一列的映射。输出可以放在如图所示的单独列中。

建立数据框...

import pandas as pd
import re

df1 = {'unique_values': ['manager', 'for', 'developer', 'executive', 'engineer', 'sales'],
       'counts': [9322, 8463, 7323, 5864, 5669, 4492],
       'Rank': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
       'Stop_word': [False, True, False, False, False, False]}
df1 = pd.DataFrame.from_dict(df1)

df2 = {'X': ['Digital Media Planner',
            'Online Bidding Executive',
            'Software Developer Manager',
            'Technical Support',
            'Software Test Engineer -hyderabad.Software Test Engineer -hyderabad',
            'Opening For Adobe Analytics Specialist.',
            'Sales- Fresher-for Leading Property Consultant.',
            'Opportunity For Azure Devops Architect',
            'BDE',
            'Technical Support/ Product Support.']}
df2 = pd.DataFrame.from_dict(df2)

解决方案...

def method(df1, df2_value):
    num_values = len(df1)

    for row_index in range(num_values):
        row = df1.iloc[[row_index]]
        df1_value = row.iloc[0,0]
        stop_word = row.iloc[0,3]

        if bool(re.search(df1_value, df2_value, re.IGNORECASE)):
            if stop_word:
                return None
            else:
                return df1_value

df2['Y'] = df2.apply(lambda row: method(df1, row.iloc[0]), axis=1)
print(df2)

输出:

                                                X          Y
0                              Digital Media Planner       None
1                           Online Bidding Executive  executive
2                         Software Developer Manager    manager
3                                  Technical Support       None
4  Software Test Engineer -hyderabad.Software Tes...   engineer
5            Opening For Adobe Analytics Specialist.       None
6    Sales- Fresher-for Leading Property Consultant.       None
7             Opportunity For Azure Devops Architect       None
8                                                BDE       None
9                Technical Support/ Product Support.       None
相关问题