Integral value in csv file verification using pandas module

时间:2016-02-03 02:56:38

标签: pandas

I am new to python pandas module and trying to use it to for simple purpose of validating the positive integral values of "Height" field in csv file.

test.csv

Name,Height
Name1,1234
Name2,1234.2
Name3,-1234
Name4,

Is there a way to identify all the invalid values ( negative, float,string, blank ) using pandas functions. I tried multiple options, which are specific to one invalid values but creates exception for other type of invalid values:

  • Catches empty values or non numeric values but not the floating and negative values: df['Height'].convert_objects(False,True,False,False).isnull()
  • Catches floating values but raises exception for empty and non-numeric values df['Height'] != df['Height'].astype(numpy.int64)
  • Forcing type during read_csv throws exception for non numeric values pandas.read_csv('test.csv', dtype={'Height':int}

Any suggestion to capture all invalid combinations in a better way or any other module for csv file content validations. I tried csv, petl too where the header field type specification seems better controlled but not as feature rich as pandas.

2 个答案:

答案 0 :(得分:0)

我不确定您要对结果做什么,但这里有几个选项,假设您已使用df = pd.read_csv(myfile)加载到数据框中。

df['valid'] = np.where((df.Height >= 0) & (df.Height.replace('', 0.5).mod(1) == 0), True, False)

这样会添加valid列:

    Name  Height  valid
0  Name1    1234   True
1  Name2  1234.2  False
2  Name3   -1234  False
3  Name4          False

或者你可以过滤掉无效的行:

df = df[(df.Height >= 0) & (df.Height.replace('', 0.5).mod(1) == 0)]

离开了你:

    Name Height
0  Name1   1234

无论哪种方式,我都使用相同的df.Height >= 0来标记字符串和底片,并使用df.Height.replace('', 0.5).mod(1) == 0来标记要移除的浮点数。我做了replace('', 0.5)来绕过mod不喜欢字符串 - 可能有更优雅的方式。

答案 1 :(得分:0)

你快到了:

  

捕获空值或非数字值,但不捕获浮动值和   负值:
  df['Height'].convert_objects(False,True,False,False).isnull()

但是通过将系列转换为数字,您不再需要处理非数字值,这很好。

Btw convert_objects现已弃用,建议使用to_numeric

  

捕获浮动值,但会为空数字和非数字引发异常   值
  df['Height'] != df['Height'].astype(numpy.int64)

如果您使用上面的纯数字系列(它们已成为nan),非数字值将不会成为问题。为避免使用Series.round()代替Series.astype(numpy.int64)

的例外情况
  

read_csv期间强制类型会抛出非数字值的异常
  pandas.read_csv('test.csv', dtype={'Height':int}

此时你不需要这个。

所以让我们把它们全部放在一起:

from StringIO import StringIO  # use io.StringIO with python3
import pandas as pd


def is_invalid(s):
    x = pd.to_numeric(s, errors='coerce')
    return (x.isnull()) | (x < 0) | (x != x.round())


text = '''Name,Height
Name1,1234
Name2,1234.2
Name3,-1234
Name4,
Name5,some string'''

df = pd.read_csv(StringIO(text))
print(df.assign(invalid=is_invalid(df['Height'])))

    Name      Height  invalid
0  Name1         1234   False
1  Name2       1234.2    True
2  Name3        -1234    True
3  Name4          NaN    True
4  Name5  some string    True