我有3个文件正在读取到数据帧(https://pastebin.com/v7BnSH3s)
map_df :将data_file
标头映射到codes_df
标头
Field Name Code Name
Gender gender_codes
Race race_codes
Ethnicity ethnicity_codes
code_df:有效代码
gender_codes race_codes ethnicity_codes
1 1 1
2 2 2
3 3 3
4 4 4
NaN NaN 5
NaN NaN 6
NaN NaN 7
data_df :需要根据代码检查的实际数据
Name Gender Race Ethnicity
Alex 99 1 7
Cindy 2 4 5
Tom 1 99 1
问题:
我需要确认data_df
每列中的每个值都是有效的代码。如果没有,我需要将Name
,无效值和列标题标签写为新列。因此,我的示例data_df
将为gender_codes
检查产生以下数据帧:
result_df:
Name Value Column
Alex 99 Gender
背景:
data_df
中的多个列。 map_df
,只是不知道哪些列映射到
哪些代码。但是,如果我可以将其合并到脚本中,那将是
理想。 我尝试过的事情:
我目前正在将每个代码列发送到一个列表,删除nan
字符串,使用loc
和isin
执行查找,然后设置result_df
... >
# code column to list
gender_codes = codes_df["gender_codes"].tolist()
# remove nan string
gender_codes = [gender_codes
for gender_codes in gender_codes
if str(gender_codes) != "nan"]
# check each value against code list
result_df = data_df.loc[(~data_df.Gender.isin(gender_codes))]
result_df = result_df.filter(items = ["Name","Gender"])
result_df.rename(columns = {"Gender":"Value"}, inplace = True)
result_df['Column'] = 'Gender'
这有效,但显然是非常原始的,不会随我的数据集扩展。我希望找到一种迭代和Pythonic的方法来解决这个问题。
编辑: 使用np.nan修改数据集
答案 0 :(得分:2)
我会将您的数据重新格式化为不同的格式
m = dict(map_df.itertuples(index=False))
c = code_df.T.stack().groupby(level=0).apply(set)
ddf = data_df.melt('Name', var_name='Column', value_name='Value')
ddf[[val not in c[col] for val, col in zip(ddf.Value, ddf.Column.map(m))]]
Name Column Value
0 Alex Gender 99
5 Tom Race 99
m # Just a dictionary with the same content as `map_df`
{'Gender': 'gender_codes',
'Race': 'race_codes',
'Ethnicity': 'ethnicity_codes'}
c # Series of membership sets
ethnicity_codes {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}
gender_codes {1.0, 2.0, 3.0, 4.0}
race_codes {1.0, 2.0, 3.0, 4.0}
dtype: object
ddf # Melted dataframe to help match the final output
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
答案 1 :(得分:1)
您将需要预处理数据框并定义验证功能。如下所示:
1。预处理
# call melt() to convert columns to rows
mcodes = codes_df.melt(
value_vars=list(codes_df.columns),
var_name='Code Name',
value_name='Valid Code').dropna()
mdata = data_df.melt(
id_vars='Name',
value_vars=list(data_df.columns[1:]),
var_name='Column',
value_name='Value')
validation_df = mcodes.merge(map_df, on='Code Name')
Out:
mcodes:
Code Name Valid Code
0 gender_codes 1
1 gender_codes 2
7 race_codes 1
8 race_codes 2
9 race_codes 3
10 race_codes 4
14 ethnicity_codes 1
15 ethnicity_codes 2
16 ethnicity_codes 3
17 ethnicity_codes 4
18 ethnicity_codes 5
19 ethnicity_codes 6
20 ethnicity_codes 7
mdata:
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
validation_df:
Code Name Valid Code Field Name
0 gender_codes 1 Gender
1 gender_codes 2 Gender
2 race_codes 1 Race
3 race_codes 2 Race
4 race_codes 3 Race
5 race_codes 4 Race
6 ethnicity_codes 1 Ethnicity
7 ethnicity_codes 2 Ethnicity
8 ethnicity_codes 3 Ethnicity
9 ethnicity_codes 4 Ethnicity
10 ethnicity_codes 5 Ethnicity
11 ethnicity_codes 6 Ethnicity
12 ethnicity_codes 7 Ethnicity
2。验证功能
def isValid(row):
valid_list = validation_df[validation_df['Field Name'] == row.Column]['Valid Code'].tolist()
return row.Value in valid_list
3。验证
mdata['isValid'] = mdata.apply(isValid, axis=1)
result = mdata[mdata.isValid == False]
Out:
result:
Name Column Value isValid
0 Alex Gender 99 False
5 Tom Race 99 False
答案 2 :(得分:1)
m, df1 = dict(map_df.values), data_df.set_index('Name')
df1[df1.apply(lambda x:~x.isin(code_df[m[x.name]]))].stack().reset_index()
Out:
Name level_1 0
0 Alex Gender 99.0
1 Tom Race 99.0