Question

我有一个熊猫数据框，其中有30列和4000行。

大约5列，我需要验证它是否符合数据验证有没有办法说类似“ if df.Gender contains any value thats not 'M' or 'F' then print error”

或if df.MaritalStatus contains a value thats not M, S, D then print error.

有人有应用条件的最佳方法吗？

df = pd.read_csv("C:/Users/ABV1234/Desktop/DailyReport.csv")

## if df.Gender包含不在['m'，'f']打印错误中的值

Answer 1

您可以检查['M', 'F']中是否有df.Gender的值：

if not any(x in df.Gender.values for x in ['M','F'])
    print("Error")

Answer 2

检查第一个条件

if df.Gender contains any value thats not 'M' or 'F' then print error

gender_series = df.Gender.values

for x in gender_series:
    if x not in ('M', 'F'):
        print("error")

检查第二个条件：

if df.MaritalStatus contains a value thats not M, S, D then print error.

maritalstatus_series = df.MaritalStatus.values

for x in maritalstatus_series:
    if x not in ('M', 'S', 'D'):
        print("error")

谢谢

Answer 3

上述答案的一个可能改进是在评估整个列后收集并报告所有失败案例。

这将返回 Gender 列不等于 'M' 或 'F' 的所有情况的过滤数据框。

import pandas as pd
df = pd.DataFrame({"MaritalStatus":["M","S","F"],"Gender":["M","S","F"]})
df.loc[~df.loc[:,"Gender"].isin(['M','F']),:] 
>>>  MaritalStatus Gender
    1             S      S

婚姻状况也是如此：

df.loc[~df.loc[:,"MaritalStatus"].isin(['M','S','D']),:]
>>>  MaritalStatus Gender
    2             F      F

如果您要抽查数据中是否存在意外值，则可以获得不符合这些条件的值：

expected_values = {"MaritalStatus":['M','S','D'],"Gender":['M','F']}
for feature in expected_values:
    print(f"The following unexpected values were found in {feature} column:",
    set(df.loc[~df.loc[:,feature].isin(expected_values[feature]),:][feature]))
>>> The following unexpected values were found in MaritalStatus column: {'F'}
>>> The following unexpected values were found in Gender column: {'S'}

或者，您可以使用 pandera 库，它允许您建立对数据集的期望，并根据这些期望对其进行验证。进行惰性求值可让您一次查看所有失败案例，而不是在每个单独案例中都失败。

import pandera as pa

schema = pa.DataFrameSchema(
    {
"MaritalStatus":pa.Column(pa.String, checks=pa.Check.isin(["M","S","D"])),
"Gender":pa.Column(pa.String, checks=pa.Check.isin(["M","F"]))
    },strict=False
)
schema.validate(df,lazy=True)

>>> 
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Daten\venv\lib\site-packages\pandera\schemas.py", line 592, in validate
    error_handler.collected_errors, check_obj
pandera.errors.SchemaErrors: A total of 2 schema errors were found.

Error Counts
------------
- schema_component_check: 2

Schema Error Summary
--------------------
                                                   failure_cases  n_failure_cases
schema_context column        check
Column         Gender        isin({'F', 'M'})                [S]                1
               MaritalStatus isin({'M', 'D', 'S'})           [F]                1

熊猫数据框列验证

3 个答案: