我有一个数据框df,
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
我在下面的plan_year
和name
列上进行了数据验证,
m4 = ((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4))
m1 = (df1[['name']].notnull().all(axis=1))
我在下面得到有效的数据框,
df1 = df[m1 & m4]
我可以得到df1中不存在的行(无效的行)
merged = df.merge(df1.drop_duplicates(), how='outer', indicator=True)
merged[merged['_merge'] == 'left_only']
我想跟踪由于验证而导致哪一行失败。
我想获取一个包含所有无效数据数据框的数据框,如下所示-
plan_year name metal_level_name Failed message
0 20118 Gold Heritage Plus 1500 - 02 Gold Failed due to wrong plan_year
1 2018 NaN Platinum name column cannot be null
有人可以帮我吗?
答案 0 :(得分:2)
您可以将numpy.select
与~
一起使用,以反转boolena蒙版:
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df['Failed message'] = np.select([~m1, ~m4], [message1, message4], default='OK')
print (df)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2 OK
df1 = df[df['Failed message'] != 'OK']
print (df1)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
编辑:对于多个错误消息,请使用concat
创建新的DataFrame
,然后使用按列名使用dot
的列名对它进行矩阵化,最后使用{{3} }:
print (df)
plan_year name metal_level_name
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
message1 = 'name column cannot be null'
message4 = 'Failed due to wrong plan_year'
df1 = pd.concat([~m1, ~m4], axis=1, keys=[message1, message4])
print (df1)
name column cannot be null Failed due to wrong plan_year
0 False True
1 True False
2 False False
1 True True
df['Failed message'] = df1.dot(df1.columns + ', ').str.rstrip(', ')
print (df)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
2 2018 Gold Heritage Plus 2000 - 01 Gold
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
2
1 name column cannot be null, Failed due to wron...
df1 = df[df['Failed message'] != '']
print (df1)
plan_year name metal_level_name \
0 20118 Gold Heritage Plus 1500 - 02 Gold
1 2018 NaN Platinum
1 20148 NaN Platinum
Failed message
0 Failed due to wrong plan_year
1 name column cannot be null
1 name column cannot be null, Failed due to wron...