I am new to python pandas module and trying to use it to for simple purpose of validating the positive integral values of "Height" field in csv file.
Name,Height
Name1,1234
Name2,1234.2
Name3,-1234
Name4,
Is there a way to identify all the invalid values ( negative, float,string, blank ) using pandas functions. I tried multiple options, which are specific to one invalid values but creates exception for other type of invalid values:
df['Height'].convert_objects(False,True,False,False).isnull()
df['Height'] != df['Height'].astype(numpy.int64)
pandas.read_csv('test.csv', dtype={'Height':int}
Any suggestion to capture all invalid combinations in a better way or any other module for csv file content validations. I tried csv, petl too where the header field type specification seems better controlled but not as feature rich as pandas.
答案 0 :(得分:0)
我不确定您要对结果做什么,但这里有几个选项,假设您已使用df = pd.read_csv(myfile)
加载到数据框中。
df['valid'] = np.where((df.Height >= 0) & (df.Height.replace('', 0.5).mod(1) == 0), True, False)
这样会添加valid
列:
Name Height valid
0 Name1 1234 True
1 Name2 1234.2 False
2 Name3 -1234 False
3 Name4 False
或者你可以过滤掉无效的行:
df = df[(df.Height >= 0) & (df.Height.replace('', 0.5).mod(1) == 0)]
离开了你:
Name Height
0 Name1 1234
无论哪种方式,我都使用相同的df.Height >= 0
来标记字符串和底片,并使用df.Height.replace('', 0.5).mod(1) == 0
来标记要移除的浮点数。我做了replace('', 0.5)
来绕过mod
不喜欢字符串 - 可能有更优雅的方式。
答案 1 :(得分:0)
你快到了:
捕获空值或非数字值,但不捕获浮动值和 负值:
df['Height'].convert_objects(False,True,False,False).isnull()
但是通过将系列转换为数字,您不再需要处理非数字值,这很好。
Btw convert_objects
现已弃用,建议使用to_numeric
。
捕获浮动值,但会为空数字和非数字引发异常 值
df['Height'] != df['Height'].astype(numpy.int64)
如果您使用上面的纯数字系列(它们已成为nan),非数字值将不会成为问题。为避免使用Series.round()
代替Series.astype(numpy.int64)
read_csv期间强制类型会抛出非数字值的异常
pandas.read_csv('test.csv', dtype={'Height':int}
此时你不需要这个。
所以让我们把它们全部放在一起:
from StringIO import StringIO # use io.StringIO with python3
import pandas as pd
def is_invalid(s):
x = pd.to_numeric(s, errors='coerce')
return (x.isnull()) | (x < 0) | (x != x.round())
text = '''Name,Height
Name1,1234
Name2,1234.2
Name3,-1234
Name4,
Name5,some string'''
df = pd.read_csv(StringIO(text))
print(df.assign(invalid=is_invalid(df['Height'])))
Name Height invalid
0 Name1 1234 False
1 Name2 1234.2 True
2 Name3 -1234 True
3 Name4 NaN True
4 Name5 some string True