Question

我有一个类似于此的csv文件，但是从1910年到2010年有大约155,000行，有83个不同的电台ID：

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       0TT    0     0   0   0   0 
216565       2008      8    SNOW        0     0T    0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0   0

我希望替换任何具有数字模式的值，然后替换一个字母或数字，然后用NaN替换两个字母。

我想要的输出是：

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       NaN    0     0   0   0   0 
216565       2008      8    SNOW        0     NaN   0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0   0

我试过用：

replace=df.replace([r'[0-9] [A-Z]'], ['NA']) replace2=replace.replace([r'[0-9][A-Z][A-Z]'], ['NA'])

我希望通过使用[0-9] [AZ]的模式来处理数字和一个字母，然后[0-9] [AZ] [AZ]将用2个字母替换任何单元格但即使没有返回错误，文件也保持完全相同。

非常感谢任何帮助。

Answer 1

您可以使用pandas方法convert_objects执行此操作。您将convert_numeric设置为True

convert_numeric：如果True尝试强制转换为数字（包括字符串），不可兑换得到NaN

>>> df
   station_id  year  month element    1   2  3  4  5  6
0      216565  2008      7    SNOW  0TT   0  0  0  0  0
1      216565  2008      8    SNOW    0  0T  0  0  0  0
2      216565  2008      9    SNOW    0   0  0  0  0  0
>>> df.convert_objects(convert_numeric=True)
   station_id  year  month element   1   2  3  4  5  6
0      216565  2008      7    SNOW NaN   0  0  0  0  0
1      216565  2008      8    SNOW   0 NaN  0  0  0  0
2      216565  2008      9    SNOW   0   0  0  0  0  0

如果您希望使用replace的路线，则需要修改通话。

>>> df
   station_id  year  month element    1   2  3  4  5  6
0      216565  2008      7    SNOW  0TT   0  0  0  0  0
1      216565  2008      8    SNOW    0  0T  0  0  0  0
2      216565  2008      9    SNOW    0   0  0  0  0  0
>>> df1.replace(value=np.nan, regex=r'[0-9][A-Z]+')
   station_id  year  month element    1    2  3  4  5  6
0      216565  2008      7    SNOW  NaN    0  0  0  0  0
1      216565  2008      8    SNOW    0  NaN  0  0  0  0
2      216565  2008      9    SNOW    0    0  0  0  0  0

这还要求您导入numpy（import numpy as np）

Answer 2

str.replace没有正则表达式。请改用re模块（假设df是一个字符串）：

import re
re.sub(r'[0-9][A-Z]+', 'NaN', df)

返回：

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       NaN    0     0   0   0   0 
216565       2008      8    SNOW        0     NaN    0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0

然而，你最好放弃，例如Pandas或np.genfromtxt会自动处理无效值。

Answer 3

from re import sub

string = "station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW 0TT 0 0 0 0 0 216565 2008 8 SNOW 0 0T 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0"

string = sub(r'\d{1}[A-Za-z]{1,2}', 'NaN', string)

print string

# station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW NaN 0 0 0 0 0 216565 2008 8 SNOW 0 NaN 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0

在csv文件中用'NaN'替换特定模式的值

3 个答案: