如果[名称]列中的姓氏相似,请填写另一列的缺失值

时间:2016-10-15 20:09:28

标签: pandas iteration

下面是一个更大的数据框的示例。

       Fare      Cabin  Pclass  Ticket  Name
257     86.5000     B77     1   110152  Cherry, Miss. Gladys
759     86.5000     B77     1   110152  Rothes, the Countess. of (Lucy Noel Martha Dye...
504     86.5000     B79     1   110152  Maioni, Miss. Roberta
262     79.6500     E67     1   110413  Taussig, Mr. Emil
558     79.6500     E67     1   110413  Taussig, Mrs. Emil (Tillie Mandelbaum)
585     79.6500     NaN     1   110413  Taussig, Miss. Ruth
475     52.0000     A14     1   110465  Clifford, Mr. George Quincy
110     52.0000     C110    1   110465  Porter, Mr. Walter Chamberlain
335     26.0000     C106    1   110469  Maguire, Mr. John Edward
158     26.5500     D22     1   110489  Borebank, Mr. John James
430     26.5500     C52     1   110564  Bjornstrom-Steffansson, Mr. Mauritz Hakan
236     75.2500     D37     1   110813  Warren, Mr. Frank Manley
366     75.2500     D37     1   110813  Warren, Mrs. Frank Manley (Anna Sophia Atkinson)
191     26.0000     NaN     1   111163  Salomon, Mr. Abraham L
170     33.5000     B19     1   111240  Van der hoef, Mr. Wyckoff
462     38.5000     E63     1   111320  Gee, Mr. Arthur H
329     57.9792     Nan     1   111361  Hippach, Miss. Jean Gertrude
523     57.9792     B18     1   111361  Hippach, Mrs. Louis Albert (Ida Sophia Fischer)

如果我想为缺少“Cabin”值的人填写“Cabin”缺失值,并使用其他人的“Cabin”值,只有

其他人(具有客舱价值的人)具有相同的姓氏,并且也在自己附近(如上面的一个或下面的一个)。

所以在上面的数据框中,[Tassuig,Miss.Ruth]的“Nan”的Cabin值将被[Tassuig,Mrs.Emil]的客舱价值[E67]的客舱价值取代,因为她超越了自己,两个条件都得到满足。 (姓氏相同,在附近)

[Hippach,Miss.Jean Gertrude]缺少的客舱价值将被替换为 [Hippach,Louis Albert夫人(Ida Sophia Fischer)] [B18]的小屋价值。

我试着考虑迭代,但就我而言

for x in df.Name.str.split(',')[x][0] ==df.Name.str.split(',')[x+1][0]:
    if df.Cabin[x] or df.Cabin[x+1] == np.nan:
      df.Cabin.replace(np.nan, 

我想确保将np.nan值替换为True值而不是np.nan。无法弄清楚如何做到这一点。

感谢。

2 个答案:

答案 0 :(得分:3)

从您的DataFrame开始

print(df)    
       Fare     Cabin  Pclass  Ticket  \
    0   86.5000       B77       1  110152   
    1   86.5000       B77       1  110152   
    2   86.5000       B79       1  110152   
    3   79.6500       E67       1  110413   
    4   79.6500       E67       1  110413   
    5   79.6500       NaN       1  110413   
    6   52.0000       A14       1  110465   
    7   52.0000      C110       1  110465   
    8   26.0000      C106       1  110469   
    9   26.5500       D22       1  110489   
    10  26.5500       C52       1  110564   
    11  75.2500       D37       1  110813   
    12  75.2500       D37       1  110813   
    13  26.0000       NaN       1  111163   
    14  33.5000       B19       1  111240   
    15  38.5000       E63       1  111320   
    16  57.9792       NaN       1  111361   
    17  57.9792       B18       1  111361   

                                                     Name  
    0                                Cherry, Miss. Gladys  
    1   Rothes, the Countess. of (Lucy Noel Martha Dye...  
    2                               Maioni, Miss. Roberta  
    3                                   Taussig, Mr. Emil  
    4              Taussig, Mrs. Emil (Tillie Mandelbaum)  
    5                                 Taussig, Miss. Ruth  
    6                         Clifford, Mr. George Quincy  
    7                      Porter, Mr. Walter Chamberlain  
    8                            Maguire, Mr. John Edward  
    9                            Borebank, Mr. John James  
    10          Bjornstrom-Steffansson, Mr. Mauritz Hakan  
    11                           Warren, Mr. Frank Manley  
    12   Warren, Mrs. Frank Manley (Anna Sophia Atkinson)  
    13                             Salomon, Mr. Abraham L  
    14                          Van der hoef, Mr. Wyckoff  
    15                                  Gee, Mr. Arthur H  
    16                       Hippach, Miss. Jean Gertrude  
    17    Hippach, Mrs. Louis Albert (Ida Sophia Fischer) 

仅使用LastName创建新列/系列。注意,使用pandas str方法可能是更好的方法,但是我无法使用任何东西

df['LastName'] = df['Name'].map(lambda x : x[:x.find(',')]) 

然后我们利用熊猫' shift和布尔索引以查看上面的乘客是否具有相同的姓氏(即Taussig案例)

    filter = (df['Cabin'].isnull()) & (df['LastName'] == df['LastName'].shift())
    df.loc[filter,'Cabin'] = df['Cabin'].shift()

然后下面的乘客将-1传递给shift()(即Hippach案例)

filter = (df['Cabin'].isnull()) & (df['LastName'] == df['LastName'].shift(-1))
df.loc[filter,'Cabin'] = df['Cabin'].shift(-1)

print(df)
       Fare     Cabin  Pclass  Ticket  \
0   86.5000       B77       1  110152   
1   86.5000       B77       1  110152   
2   86.5000       B79       1  110152   
3   79.6500       E67       1  110413   
4   79.6500       E67       1  110413   
5   79.6500       E67       1  110413   
6   52.0000       A14       1  110465   
7   52.0000      C110       1  110465   
8   26.0000      C106       1  110469   
9   26.5500       D22       1  110489   
10  26.5500       C52       1  110564   
11  75.2500       D37       1  110813   
12  75.2500       D37       1  110813   
13  26.0000       NaN       1  111163   
14  33.5000       B19       1  111240   
15  38.5000       E63       1  111320   
16  57.9792       B18       1  111361   
17  57.9792       B18       1  111361   

                                                 Name                LastName  
0                                Cherry, Miss. Gladys                  Cherry  
1   Rothes, the Countess. of (Lucy Noel Martha Dye...                  Rothes  
2                               Maioni, Miss. Roberta                  Maioni  
3                                   Taussig, Mr. Emil                 Taussig  
4              Taussig, Mrs. Emil (Tillie Mandelbaum)                 Taussig  
5                                 Taussig, Miss. Ruth                 Taussig  
6                         Clifford, Mr. George Quincy                Clifford  
7                      Porter, Mr. Walter Chamberlain                  Porter  
8                            Maguire, Mr. John Edward                 Maguire  
9                            Borebank, Mr. John James                Borebank  
10          Bjornstrom-Steffansson, Mr. Mauritz Hakan  Bjornstrom-Steffansson  
11                           Warren, Mr. Frank Manley                  Warren  
12   Warren, Mrs. Frank Manley (Anna Sophia Atkinson)                  Warren  
13                             Salomon, Mr. Abraham L                 Salomon  
14                          Van der hoef, Mr. Wyckoff            Van der hoef  
15                                  Gee, Mr. Arthur H                     Gee  
16                       Hippach, Miss. Jean Gertrude                 Hippach  
17    Hippach, Mrs. Louis Albert (Ida Sophia Fischer)                 Hippach 

答案 1 :(得分:2)

groupby + fillna

# back fills, then forward fills
def bffill(x):
    return x.bfill().ffill()

# group by last name
df['Cabin'] = df.groupby(df.Name.str.split(',').str[0]).Cabin.apply(bffill)

df

enter image description here