我对编码还很陌生,我得到了一个学校项目,必须在Pandas和Sklearn中使用Python处理数据集。问题是我有一个熊猫数据框,需要使用留一法交叉验证(因为该数据框中只有140个人)来拆分。
编辑:正如@FChm所说,我使用了sklearn的LeaveOneOut文档。链接到这里:Documentation
import pandas as pd
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('model_2.csv')
X = data.iloc[:,0:11]
y = data.loc[:,'Diagnosis']
loo = LeaveOneOut()
print(X)
print(y)
print(type(X))
for train_index, test_index in loo.split(X): # Split in X
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
model = LogisticRegression(solver='lbfgs', multi_class='auto' )
model.fit(X_train, y_train)
z = model.score(X_test, y_test)
print (z)
问题在于,对列进行切片会给我一个数据帧,而我得到的错误消息是:
Fusobacterium nucleatum [1480] ... Bilophila wadsworthia [756]
0 0.000000 ... 0.001307
1 0.000617 ... 0.000779
2 0.000000 ... 0.000474
3 0.000000 ... 0.000660
4 0.000025 ... 0.001572
5 0.000000 ... 0.000881
6 0.000000 ... 0.000175
7 0.000000 ... 0.000141
8 0.000181 ... 0.000778
9 0.000000 ... 0.011267
10 0.000962 ... 0.002417
11 0.000011 ... 0.000618
12 0.000000 ... 0.001590
13 0.000001 ... 0.004002
14 0.000000 ... 0.000650
15 0.000029 ... 0.007482
16 0.000000 ... 0.001184
17 0.000000 ... 0.001821
18 0.000045 ... 0.000768
19 0.000000 ... 0.000003
20 0.000182 ... 0.001198
21 0.000000 ... 0.004408
22 0.000000 ... 0.003469
23 0.000000 ... 0.002255
24 0.000292 ... 0.000174
25 0.000000 ... 0.002559
26 0.000000 ... 0.000901
27 0.000015 ... 0.000458
28 0.000045 ... 0.000009
29 0.000437 ... 0.000834
.. ... ... ...
111 0.000000 ... 0.000000
112 0.000000 ... 0.000234
113 0.000000 ... 0.000190
114 0.000000 ... 0.000048
115 0.000000 ... 0.000792
116 0.000000 ... 0.001992
117 0.000010 ... 0.000000
118 0.000108 ... 0.001133
119 0.000000 ... 0.001465
120 0.000000 ... 0.005596
121 0.000000 ... 0.000284
122 0.000000 ... 0.000037
123 0.000000 ... 0.000008
124 0.000000 ... 0.001098
125 0.000000 ... 0.000179
126 0.000000 ... 0.000309
127 0.000030 ... 0.001022
128 0.000000 ... 0.000060
129 0.000002 ... 0.000795
130 0.000000 ... 0.002253
131 0.000000 ... 0.000048
132 0.000000 ... 0.001198
133 0.000000 ... 0.000755
134 0.000011 ... 0.001414
135 0.000000 ... 0.000739
136 0.000000 ... 0.000000
137 0.000000 ... 0.000275
138 0.000000 ... 0.000330
139 0.000000 ... 0.055944
140 0.000000 ... 0.000531
[141 rows x 11 columns]
0 Cancer
1 Cancer
2 Cancer
3 Cancer
4 Cancer
5 Cancer
6 Cancer
7 Cancer
8 Cancer
9 Cancer
10 Cancer
11 Cancer
12 Cancer
13 Cancer
14 Cancer
15 Cancer
16 Cancer
17 Cancer
18 Cancer
19 Cancer
20 Cancer
21 Cancer
22 Cancer
23 Cancer
24 Cancer
25 Cancer
26 Cancer
27 Cancer
28 Cancer
29 Cancer
...
111 Normal
112 Normal
113 Normal
114 Small Adenoma
115 Small Adenoma
116 Small Adenoma
117 Small Adenoma
118 Small Adenoma
119 Small Adenoma
120 Small Adenoma
121 Small Adenoma
122 Small Adenoma
123 Small Adenoma
124 Small Adenoma
125 Small Adenoma
126 Small Adenoma
127 Small Adenoma
128 Small Adenoma
129 Small Adenoma
130 Small Adenoma
131 Small Adenoma
132 Small Adenoma
133 Small Adenoma
134 Small Adenoma
135 Small Adenoma
136 Small Adenoma
137 Small Adenoma
138 Small Adenoma
139 Small Adenoma
140 Small Adenoma
Name: Diagnosis, Length: 141, dtype: object
<class 'pandas.core.frame.DataFrame'>
TRAIN: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
127 128 129 130 131 132 133 134 135 136 137 138 139 140] TEST: [0]
Traceback (most recent call last):
File ".\model.py", line 19, in <module>
X_train, X_test = X[train_index], X[test_index]
File "C:\Users\Chris\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "C:\Users\Chris\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Users\Chris\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: '[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18\n 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36\n 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54\n 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72\n 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90\n 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108\n 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126\n 127 128 129 130 131 132 133 134 135 136 137 138 139 140] not in index'
是否可以在Pandas DataFrame上使用留一法,还是应该使用train_test_split?我怎么能像遗忘行为那样使用它呢?
答案 0 :(得分:1)
我遇到了同样的问题,并使用了代码:
Widget buildCards({image, title, type}) {
return Column(
children: <Widget>[
// UI showing Image Url, title and description in a Card.
// then I have an icon, to skip the card on top and show the one behind
IconButton(
icon: Icon(FontAwesomeIcons.forward,
color: Colors.greenAccent, size: 21),
onPressed: () => //SKIP CARD
对我来说很好。
答案 1 :(得分:0)
我相信该错误是因为您对DataFrames的索引不正确(即,您将它们视为数组)。这可能是因为您没有完全理解文档中的示例。注意:如果您确实是直接从documentation for LeaveOneOut复制的,那么您至少应该直接引用它:“我从这里改编了代码...”
无论如何,您有两种解决方案来解决您的问题:
a)您可以使用pd.DataFrame的to_numpy()方法将X
和y
转换为numpy数组:
X = data.iloc[:, 0:11].to_numpy()
y = data.loc[:, 'Diagnosis'].to_numpy()
b)(您可以在前几行中进行更改),并可以使用integer based indexing更改代码并为代码建立索引。
X_train = X.iloc[train_index, :]
# … and so on