根据其他列

时间:2018-05-18 20:17:41

标签: python pandas

我对熊猫的掌握很弱,对Python没有很强的理解。

我想根据现有列(d.Aliasd.Company)的值更新列(d2.Alias)。如果d.Aliasd2.Alias的子字符串,则d2.Alias应等于d.Company

示例数据集:

d = {'Company': ['The Cool Company Inc', 'Cool Company, Inc', 'The Cool 
        Company', 'The Shoe Company', 'Muffler Store', 'Muffler Store'],
    'Position': ['Cool Job A', 'Cool Job B', 'Cool Job C', 'Salesman', 
        'Sales', 'Technician'],
    'City': ['Tacoma', 'Tacoma','Tacoma', 'Boulder', 'Chicago', 'Chicago'],
    'State': ['AZ', 'AZ', 'AZ', 'CO', 'IL', 'IL'],
    'Alias': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
d2 = {'Company': ['The Cool Company, Inc.', 'The Shoe Company', 'Muffler 
         Store LLC'],
    'Alias': ['Cool Company', np.nan, 'Muffler'],
    'First Name': ['Carol', 'James', 'Frankie'],
    'Last Name': ['Fisher', 'Smith', 'Johnson']}

np.nan的{​​{1}}是因为对于该实例,不需要别名。

我尝试过使用The Shoe Company.loc循环,for循环,whilepandas.where,以及每种循环的几种变体,但没有理想的结果。使用numpy.where循环时,for的末尾已复制到d2.Alias中的所有行。但是,我无法重现这一点。

以前的帖子,我看过哪些我无法上班,或者我对它们不了解:Conditionally fill column with value from another DataFrame based on row match in Pandas pandas create new column based on values from other columns

非常感谢任何帮助!

编辑:

Expected output

更新:
经过几天的修修补补,我达到了预期的效果。在温的回应中,我不得不改变一些事情。

首先,我创建了一个名为d.Alias的{​​{1}}列表:
df2.Alias

然后,我不得不删除aliases。产生我想要的结果的线:
aliases = df2.Alias.unique()

2 个答案:

答案 0 :(得分:2)

一种方法是遍历你可能会小得多的数据帧,只是看看别名何时是d.Company的子字符串,然后用它替换别名。

import pandas as pd
d = pd.DataFrame(d)
d2 = pd.DataFrame(d2)

for row in d2[d2.Alias.notnull()].itertuples():
    d.loc[d.Company.str.contains(row.Alias), 'Alias'] = row.Alias

print(d)
#          Alias     City               Company    Position State
#0  Cool Company   Tacoma  The Cool Company Inc  Cool Job A    AZ
#1  Cool Company   Tacoma     Cool Company, Inc  Cool Job B    AZ
#2  Cool Company   Tacoma      The Cool Company  Cool Job C    AZ
#3           NaN  Boulder      The Shoe Company    Salesman    CO
#4       Muffler  Chicago         Muffler Store       Sales    IL
#5       Muffler  Chicago         Muffler Store  Technician    IL

答案 1 :(得分:2)

来自fuzzywuzzy

的解决方案
from fuzzywuzzy import process

df1['Alias']=df1.Company.apply(lambda x :[process.extract(x, df2.Company, limit=1)][0][0][0]).map(df2.set_index('Company').Alias)
df1
Out[31]: 
          Alias     City               Company    Position State
0  Cool Company   Tacoma  The Cool Company Inc  Cool Job A    AZ
1  Cool Company   Tacoma     Cool Company, Inc  Cool Job B    AZ
2  Cool Company   Tacoma      The Cool Company  Cool Job C    AZ
3           NaN  Boulder      The Shoe Company    Salesman    CO
4       Muffler  Chicago         Muffler Store       Sales    IL
5       Muffler  Chicago         Muffler Store  Technician    IL