我有以下DataFrame:
Activity SMILES
0 1.0 CCN1CCCC1CNC(=O)c1cc([N+](=O)[O-])cc(O)c1OC
1 1.0 O=c1cc(-c2cccs2)oc2ccc(OCCCCCCN3CCC(O)CC3)cc12
2 1.0 CCCCCCCCCC(=O)N1c2ccc(Cl)cc2N=C(N2CCN(C)CC2)c2...
3 1.0 CCN1C(=O)c2ccccc2S(=O)c2ccc(C(=O)NCc3ccc4c(c3)...
4 1.0 CCN1CCc2cc(OCCF)cc3c2C1Cc1cccc(O)c1-3
... ...
,我想获得以下输出:
Activity SMILES cluster cluster set
0 1.0 CCN1CCCC1CNC(=O)c1cc([N+](=O)[O-])cc(O)c1OC 0.0 val
1 1.0 O=c1cc(-c2cccs2)oc2ccc(OCCCCCCN3CCC(O)CC3)cc12 898.0 test
2 1.0 CCCCCCCCCC(=O)N1c2ccc(Cl)cc2N=C(N2CCN(C)CC2)c2... 7.0 val
3 1.0 CCN1C(=O)c2ccccc2S(=O)c2ccc(C(=O)NCc3ccc4c(c3)... 4.0 train
5 1.0 FC(F)(F)c1cccc(N2CCN(Cc3cn(-c4ccccc4)c(-c4cccc... 856.0 val
... ... ... ...
我有三个元组列表(train_points,test_points和val_points),如下所示:
[(4633, 0),
(3935, 3907),
(1410, 1409),
(1120, 1121, 3, 3771),
...]
我尝试实现以下循环序列:
#Remove irrelevant information from the DataFrame
df_triplets = df[['Activity','SMILES']]
# Add clustering information
list_points = [train_points, test_points, val_points]
name_points = ['train','test','val']
# this loop should be working but it doesn't work
for name, points in zip(name_points,list_points):
for num, cluster in enumerate(points):
for molecule in cluster:
df_triplets.loc[molecule,'cluster'] = num
df_triplets.loc[molecule, 'cluster set'] = name
但是,这仅为df_triplets提供了最后一个群集集(val_points),而不是我感兴趣的三个集。请注意,每个“分子”都是唯一的编号。