Question

这是我拥有的数据框的子集。对于句子列具有值的每一行，对于接下来的两行重复列A B C D而不具有句子列的值。如何删除句子为空值的第二行。我需要为句子列保留第一行的空值。

     A    B   C    D             R      sentence              ADR 
    112 135 21  EffexorXR.21    1    lack of good feeling.    good
    113 135 21  EffexorXR.21    1                               1 
    114 135 21  EffexorXR.21    1   
    115 136 21  EffexorXR.21    2   Feel disconnected         disconnected
    116 136 21  EffexorXR.21    2        
    117 136 21  EffexorXR.21    2    
    118 142 22  EffexorXR.22    1   Weight gain                gain
    119 142 22  EffexorXR.22    1                                1
    120 142 22  EffexorXR.22    1

输出就像这样

   A    B   C    D             R        sentence               ADR     
    112 135 21  EffexorXR.21    1    lack of good feeling.     good
    113 135 21  EffexorXR.21    1                               1
    115 136 21  EffexorXR.21    2    Feel disconnected        disconnected       
    116 136 21  EffexorXR.21    2   
    118 142 22  EffexorXR.22    1    Weight gain               gain
    119 142 22  EffexorXR.22    1                               1

如果我使用以下代码：

df = df[pd.notnull(df['sentences'])]，然后它将删除具有空值的两行。有什么建议吗？

以下解决方案不起作用。

df.set_index('A').drop_duplicates().reset_index()

Answer 1

您可以使用drop_duplicates。 A列是唯一的，因此我们将其设置为索引。它将使用剩余的列来检查重复项并删除它们（如果有的话）。最后使用reset_index将A列恢复。

df.set_index('A').drop_duplicates().reset_index()
Out[847]: 
     A    B   C             D  R               sentence
0  112  135  21  EffexorXR.21  1  lack of good feeling.
1  113  135  21  EffexorXR.21  1                       
2  115  136  21  EffexorXR.21  2      Feel disconnected
3  116  136  21  EffexorXR.21  2                       
4  118  142  22  EffexorXR.22  1            Weight gain
5  119  142  22  EffexorXR.22  1

更新了答案，仅使用子集作为检查重复项的键。

df.drop_duplicates(subset=['B','C','D','sentence'])
Out[866]: 
     A    B   C             D  R               sentence           ADR
0  112  135  21  EffexorXR.21  1  lack of good feeling.          good
1  113  135  21  EffexorXR.21  1                                    1
3  115  136  21  EffexorXR.21  2      Feel disconnected  disconnected
4  116  136  21  EffexorXR.21  2                                  nan
6  118  142  22  EffexorXR.22  1            Weight gain          gain
7  119  142  22  EffexorXR.22  1                                    1

Answer 2

您可以看到组合列的重复项，并将其用于mask原始dataframe：

new_df = df[~df[['B','C','D', 'R', 'sentence']].duplicated()]
print(new_df)

输出：

     A    B   C             D  R               sentence           ADR
0  112  135  21  EffexorXR.21  1  lack of good feeling.          good
1  113  135  21  EffexorXR.21  1                                    1
3  115  136  21  EffexorXR.21  2      Feel disconnected  disconnected
4  116  136  21  EffexorXR.21  2                                     
6  118  142  22  EffexorXR.22  1            Weight gain          gain
7  119  142  22  EffexorXR.22  1                                    1

如何删除具有空值的特定行的行

2 个答案: