在熊猫中用NaN替换某些字符串

时间:2018-12-14 16:54:59

标签: python pandas

我有pandas DF,在其中我需要遍历两列(位置和事件)中的值,并用NaN替换字符串“ Gate-3”“ NO Access”。

下面是示例DF。

Time        Location    Event               Badge ID
18:28:59    Gate-2      Access Granted      81002
18:28:12    Gate-1      Access Granted      80557
18:27:55    Gate-3      Access Granted      80557
18:27:44    Gate-3      NO Access           80398
18:25:38    Gate-1      NO Access           80978
18:25:30    Gate-2      Access Granted      73680
18:23:56    Gate-1      Access Granted      73680
18:23:52    Gate-2      Access Granted      80557
18:23:19    Gate-2      NO Access           128
18:23:16    Gate-1      Access Granted      80557

预期输出为

       Time Location           Event  Badge ID
0  18:28:59   Gate-2  Access Granted     81002
1  18:28:12   Gate-1  Access Granted     80557
2  18:27:55      NaN  Access Granted     80557
3  18:27:44      NaN             NaN     80398
4  18:25:38   Gate-1             NaN     80978
5  18:25:30   Gate-2  Access Granted     73680
6  18:23:56   Gate-1  Access Granted     73680
7  18:23:52   Gate-2  Access Granted     80557
8  18:23:19   Gate-2             NaN       128
9  18:23:16   Gate-1  Access Granted     80557

4 个答案:

答案 0 :(得分:2)

您可以在加载XLS文件时通过指定na_values参数进行设置。

df = pd.read_excel('file.xls', na_values=['Gate-3', 'NO Access'])
print(df)

       Time Location           Event  Badge ID
0  18:28:59   Gate-2  Access Granted     81002
1  18:28:12   Gate-1  Access Granted     80557
2  18:27:55      NaN  Access Granted     80557
3  18:27:44      NaN             NaN     80398
4  18:25:38   Gate-1             NaN     80978
5  18:25:30   Gate-2  Access Granted     73680
6  18:23:56   Gate-1  Access Granted     73680
7  18:23:52   Gate-2  Access Granted     80557
8  18:23:19   Gate-2             NaN       128
9  18:23:16   Gate-1  Access Granted     80557

与IMO相比,这比在 加载数据后清理数据要好。

答案 1 :(得分:2)

您可以在条件满足的情况下获得布尔掩码

mask = df.Location.eq('Gate-3') & df.Event.eq('NO Access') # df is your dataframe

您可以使用该掩码设置NaN所需的任何列,如下所示:

df.loc[mask, ['Location', 'Event']] = np.nan # imported numpy as np                                                                         

编辑:

似乎您已经更改了规格。如果要将NaN设置为“位置或事件”列与您的前哨值匹配的地方,请使用两个掩码。

locmask = df.Location.eq('Gate-3')                                                                                     
df.loc[locmask, 'Location'] = np.nan                                                                                   
evmask = df.Event.eq('NO Access')                                                                                      
df.loc[evmask, 'Event'] = np.nan

答案 2 :(得分:1)

如果我没有误解您的问题,那怎么办?

import pandas as pd
import numpy as np
df.loc[df.Location == 'Gate-3', 'Location'] = np.nan
df.loc[df.Event == 'NO Access', 'Event'] = np.nan

答案 3 :(得分:0)

不必根据条件设置列值进行迭代。相反,您将使用布尔索引。

示例:

data = {'Time':['18:28:59', '18:28:59', '18:28:59'], 
     'Location':['Gate-2', 'Gate-3', 'Gate-1', ], 
     'Event':['Access Granted', 'NO Access', 'NO Access'], 
     'BadgeID':[81002, 80557, 80557]}

df = pd.DataFrame(data)

    Time       Location     Event         BadgeID
0   18:28:59   Gate-2     Access Granted  81002
1   18:28:59   Gate-3     NO Access       80557
2   18:28:59   Gate-1     NO Access       80557

“ loc”方法是基于标签的索引器,它接受布尔数组以及其他选项。

条件表达式:

df.Location == 'Gate-3'

返回布尔数组或Series

0    False
1    True
2    False
Name: Location, dtype: bool

您可以使用内置函数type()进行检查

type(df.Location == 'Gate-3')
# pandas.core.series.Series

该系列用作原始DataFrame的loc方法的行索引。

loc方法采用行索引器和列索引器。 所以下面的声明

df.loc[df.Location == 'Gate-3', 'Location'] = np.nan

翻译为:

  

将“位置”为Gate-3的行和“位置”列的交点设置为空值