输入:
我有一个如下所示的数据框
Job job = Job.getInstance();
输出:
我需要找到Remaining_name列,如下所示
Full_Name Name1 Name2
John Mathew Davidson John Davidson
Paul Theodre Luther Paul Theodre
Victor George Mary George Mary
说明:
我需要在另一列的值(句子)中比较一个以上的列的值(单词),并找到可能在整个字符串的任何位置出现的不匹配单词。
答案 0 :(得分:1)
使用替换的矢量化解决方案
df['Remaining_name'] = df.apply(lambda x: x['Full_Name'].replace(x['Name1'], '').replace(x['Name2'], ''), axis=1).str.strip()
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
编辑:如果有很多以“名称”开头的列,则可以选择一个切片并根据正则表达式模式替换Full_Name中的值
df['tmp'] = df[df.columns[df.columns.str.startswith('Name')]].apply('|'.join, axis = 1)
df['Remaining_name'] = df.apply(lambda x: x.replace(x['tmp'], '', regex = True), axis = 1)['Full_Name'].str.strip()
df.drop('tmp', axis =1, inplace = True)
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor
3 Henry Patrick John Harrison Henry John Patrick Harrison
答案 1 :(得分:1)
这是您提供的数据:
import pandas as pd
full_name = ['John Mathew Davidson', 'Paul Theodre Luther', 'Victor George Mary']
name_1 = ['John', 'Paul', 'George']
name_2 = ['Davidson', 'Theodre', 'Mary']
df = pd.DataFrame({'Full_Name':full_name, 'Name1':name_1, 'Name2':name_2 })
为了对一行中的多个列执行操作,最好的办法是分别定义该函数。它使代码更具可读性,更易于调试 该函数将以DataFrame行作为输入:
def find_missing_name(row):
known_names = [row['Name1'], row['Name2']] ## we add known names to a list to check it later
full_name_list = row['Full_Name'].split(' ') ## converting the full name to the list by splitting it on spaces
## WARNING! this function works well only if you are sure your 'Full_Name' column items are separated by a space.
missing_name = [x for x in full_name_list if x not in known_names] ## looping throught the full name list and comparing it to the known_names list, to only keep the missing ones.
missing_name = ','.join(missing_name) ## in case there are more than one missing names convert them all in a string separated by comma
return missing_name
现在将功能应用于现有的DataFrame:
df['missing_name'] = df.apply(find_missing_name, axis=1) ## axis=1 means 'apply to each row', where axis=0 means 'apply to each column'
希望这会有所帮助:)
答案 2 :(得分:1)
您可以使用以下代码一行完成操作:
df['Remaining_name'] = df.apply(lambda x: [i for i in x['Full_Name'].split() if all(i not in x[c] for c in df.columns[1:])], axis=1)
这将以Remaining_name
的形式返回list
列,但是在您的名称包含三个以上子字符串的情况下,此功能将很有用,例如:
Full_Name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson [Mathew]
1 Paul Theodre Luther Paul Theodre [Luther]
2 Victor George Mary George Mary [Victor]
3 Henry Patrick John Harrison Patrick Henry [John, Harrison]
答案 3 :(得分:0)
尝试一下:
import numpy as np
In [835]: df
Out[835]:
Full_name Name1 Name2
0 John Mathew Davidson John Davidson
1 Paul Theodre Luther Paul Theodre
2 Victor George Mary George Mary
ll = []
In [854]: for i, r in df.iterrows():
...: big_list = r[0].split(' ')
...: l1 = [r[1]]
...: l2 = [r[2]]
...: remaining_item = np.setdiff1d(big_list, l1+l2)[0]
...: ll.append(remaining_item)
In [856]: df['Remaining_name'] = ll
In [857]: df
Out[857]:
Full_name Name1 Name2 Remaining_name
0 John Mathew Davidson John Davidson Mathew
1 Paul Theodre Luther Paul Theodre Luther
2 Victor George Mary George Mary Victor