正如您在下面看到的,我的蛋白质ID数据框有4292个成员,当我尝试将它们打印出来时,我在索引13处得到一个错误,我不明白为什么。
知道发生了什么事吗?
print proteinID.shape
print X_final.shape
for i,prot in enumerate(X_final):
print i
print prot
print proteinID[i]
这给了我:
(4292L,)
(4292L, 4L)
0
[ 0.01070217 0.86624627 0.30031799 1.0022054 ]
Q9BV57
1
[ 0.14132098 0.5899623 -0.08037944 0.04028686]
Q04446
2
[ 0.14768145 0.37698604 -0.08798323 -0.71181829]
P61604
3
[ 0.23194252 -0.17301326 -0.20914528 0.27447231]
Q15029
4
[ 0.13608163 0.41788998 0.06103427 -0.1557695 ]
Q9NRX4
5
[ 0.11981057 0.62419406 0.085566 0.43029529]
P31946
6
[ 0.14734698 0.53942167 0.1647835 0.20525244]
P62258
7
[ 0.13301821 0.25249911 0.32216093 0.46965642]
Q04917
8
[ 0.30891193 0.35936887 0.14029331 0.22116058]
P61981
9
[ 0.15670011 -0.0317209 0.48168144 0.58226224]
P31947;REV__Q13315
10
[ 0.059664 0.52769527 0.09302036 0.28445371]
P27348
11
[ 0.22201161 0.703846 0.19846719 0.53470435]
P63104
12
[ 0.53312759 0.48972197 -0.15224852 0.16086491]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-54-45a793f9a457> in <module>()
4 print i
5 print prot
----> 6 print proteinID[i]
C:\Anaconda\lib\site-packages\pandas\core\series.pyc in __getitem__(self, key)
507 def __getitem__(self, key):
508 try:
--> 509 result = self.index.get_value(self, key)
510
511 if not np.isscalar(result):
C:\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series,
key)
1415
1416 try:
-> 1417 return self._engine.get_value(s, k)
1418 except KeyError as e1:
1419 if len(self) > 0 and self.inferred_type in
['integer','boolean']:
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3109)()
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:2840)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3700)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item
(pandas\hashtable.c:7229)()
pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item
(pandas\hashtable.c:7167)()
KeyError: 12L
编辑:蛋白质ID的前50个值
for i,n in enumerate(proteinID):
print i, n
0 Q9BV57
1 Q04446
2 P61604
3 Q15029
4 Q9NRX4
5 P31946
6 P62258
7 Q04917
8 P61981
9 P31947;REV__Q13315
10 P27348
11 P63104
12 O60613
13 Q9C0C2
14 Q9Y2I7
15 Q01970
16 P19174
17 P09543
18 Q6L8Q7
19 P62333
20 P62191
21 P17980
22 P43686
23 P35998
24 P62195
25 Q99460
26 O75832
27 O00231
28 O00232
29 Q9UNM6
30 O00487
31 Q13200
32 O43242
33 P55036
34 Q15008
35 P51665
36 P48556
37 O00233
38 Q13442
39 P82912
40 O15235
41 O60783
42 Q9Y3D3
43 Q9Y2R5
44 Q9NVS2
45 Q9Y676
46 Q9Y399
47 P82650
48 Q9Y3D9
49 P82663
50 Q9BYN8
答案 0 :(得分:0)
我注意到在使用以下方法删除NaN值后
#instead of imputing, we remove rows with nan values
valid_mask = [np.all(~np.isnan(x)) for x in data.values]
print data[valid_mask].shape
X_imputed = data[valid_mask].values
proteinID = proteinID[valid_mask]
保留索引,因此在这种情况下,缺少的索引曾经是具有NaN值的行。