我正在尝试创建一个字典,其中每个键的值是两个字典。
我有两个患者(正常组织,疾病组织)条形码列表,它们对应于数据框中的值列。我的目标是匹配两个列表中的患者,然后针对两个列表中的每个患者,将其正常值和疾病组织值附加到字典中。字典键将是患者条形码,而字典值将是正常组织的另一个字典:从数据框中提取的值,而疾病组织:从数据框中提取的值。所以从
开始In [3]: df = pd.DataFrame({'Patient1_Normal':['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan'],
'Patient1_Disease':[0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'Patient2_Disease':['nan', 'nan', 'nan', 1.0, 0.24, 0.67, 0.97, 0.98],
'Patient3_Normal': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9],
'Patient3_Disease':[0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'Patient4_Normal':['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91],
'Patient4_Disease':['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'Patient5_Disease': [0.34, 0.27, 'nan', 0.16, 0.32, 0.27, 0.55, 0.51]})
In [4]: df
Out[4]: Patient1_Normal Patient1_Disease Patient2_Disease Patient3_Normal Patient3_Disease Patient4_Normal Patient4_Disease Patient5_Disease
0 nan 0.12 nan 0.21 0.11 nan nan 0.34
1 0.01 0.06 nan 0.25 0.45 0.35 nan 0.27
2 0.1 0.19 nan 0.63 nan nan 0.56 nan
3 0.16 0.34 1 0.92 0.45 0.22 0.72 0.16
4 0.88 nan 0.24 0.30 0.22 0.45 nan 0.32
5 0.83 nan 0.67 0.56 0.89 0.66 0.97 0.27
6 0.82 0.73 0.97 0.78 0.17 0.21 0.91 0.55
7 nan 0.91 0.98 0.90 0.12 0.91 0.79 0.51
这是我到目前为止所拥有的:
D_col = [col for col in df if '_Disease' in col]
N_col = [col for col in df if '_Normal' in col]
paired_patients = {}
psi_sets = {}
psi_sets['d'] = []
psi_sets['n'] = []
for patient in N_col:
patient_id = patient[0:8]
n_id = patient
d_id = [i for i in D_col if patient_id in i]
if len(d_id) > 0:
psi_sets['n'] = df[n_id].to_list()
for d in d_id:
psi_sets['d'] = df[d].to_list()
paired_patients[patient_id] = psi_sets
但是,我的paired_patients
字典值是覆盖而不是附加,因此paired_patients
的输出看起来像这样:
{'Patient1': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]},
'Patient3': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]},
'Patient4': {'d': ['nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
我该如何修正代码的最后一位,以便为每个患者正确附加paired_patient
字典值,以使paired_patient
字典看起来像这样:
{'Patient1': {'d': [0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'n': ['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan']},
'Patient3': {'d': [0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'n': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9]},
'Patient4': {'nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
答案 0 :(得分:1)
D_col = [col for col in df if '_Disease' in col]
N_col = [col for col in df if '_Normal' in col]
paired_patients = {}
for patient in N_col:
psi_sets = {}
patient_id = patient[0:8]
n_id = patient
d_id = [i for i in D_col if patient_id in i]
if len(d_id) > 0:
psi_sets['n'] = df[n_id].to_list()
for d in d_id:
psi_sets['d'] = df[d].to_list()
paired_patients[patient_id] = psi_sets
答案 1 :(得分:0)
您可以使用df.melt
,pd.concat
,series.str.split
,df.replace
,df.groupby
和df.xs
,最后使用df.to_dict
。
请检查以下内容:
>>> df2 = (pd.concat([
df.melt().variable.str.split('_', expand=True),
df.melt().drop('variable',1)
], axis=1)
.replace({'Normal':'n', 'Disease':'d'})
.groupby([0,1]).agg(list))
>>> paired_patients = {k: v for k, v in
df2.groupby(level=0)
.apply(lambda df: df.xs(df.name).value.to_dict())
.to_dict().items()
if not ({'d', 'n'} ^ v.keys())}
>>> paired_patients
{'Patient1': {'d': [0.12, 0.06, 0.19, 0.34, 'nan', 'nan', 0.73, 0.91],
'n': ['nan', 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, 'nan']},
'Patient3': {'d': [0.11, 0.45, 'nan', 0.45, 0.22, 0.89, 0.17, 0.12],
'n': [0.21, 0.25,0.63,0.92,0.3, 0.56, 0.78, 0.9]},
'Patient4': {'nan', 'nan', 0.56, 0.72, 'nan', 0.97, 0.91, 0.79],
'n': ['nan', 0.35, 'nan', 0.22, 0.45, 0.66, 0.21, 0.91]}}
EXPLANTION :
>>> df.melt()
variable value
0 Patient1_Normal NaN
1 Patient1_Normal 0.01
2 Patient1_Normal 0.10
.. ... ...
62 Patient5_Disease 0.55
63 Patient5_Disease 0.51
>>> df.melt().variable.str.split('_', expand=True)
0 1
0 Patient1 Normal
1 Patient1 Normal
2 Patient1 Normal
.. ... ...
62 Patient5 Disease
63 Patient5 Disease
[64 rows x 2 columns]
# then concat these two, replace 'Normal' and 'Disease' with 'n' and 'd' and drop
# the 'variable' column
>>> pd.concat([
df.melt().variable.str.split('_', expand=True),
df.melt().drop('variable',1)
], axis=1).replace({'Normal':'n', 'Disease':'d'})
0 1 value
0 Patient1 n NaN
1 Patient1 n 0.01
2 Patient1 n 0.10
.. ... .. ...
62 Patient5 d 0.55
63 Patient5 d 0.51
[64 rows x 3 columns]
# then groupby column [0, 1] and aggregate into list:
>>> df2 = _.groupby([0,1]).agg(list)
>>> df2
value
0 1
Patient1 d [0.12, 0.06, 0.19, 0.34, nan, nan, 0.73, 0.91]
n [nan, 0.01, 0.1, 0.16, 0.88, 0.83, 0.82, nan]
Patient2 d [nan, nan, nan, 1.0, 0.24, 0.67, 0.97, 0.98]
Patient3 d [0.11, 0.45, nan, 0.45, 0.22, 0.89, 0.17, 0.12]
n [0.21, 0.25, 0.63, 0.92, 0.3, 0.56, 0.78, 0.9]
Patient4 d [nan, nan, 0.56, 0.72, nan, 0.97, 0.91, 0.79]
n [nan, 0.35, nan, 0.22, 0.45, 0.66, 0.21, 0.91]
Patient5 d [0.34, 0.27, nan, 0.16, 0.32, 0.27, 0.55, 0.51]
# Now groupby level=0, and convert that into dict, and finally check whether
# both 'n' and 'd' are present as keys by using symmetric set difference
# properties of dict_keys objects
>>> paired_patients = {k: v for k, v in
df2.groupby(level=0)
.apply(lambda df: df.xs(df.name).value.to_dict())
.to_dict().items()
if ('n' in v) and ('d' in v)}