Question

所以我有20000条记录的csv。第一列是标签列，每行包含一个字母。其他列是宽度，高度等属性。我将其导入并将每条记录复制到一个数组中

with open('Letter.csv') as f:
reader = csv.reader(f)
annotated_data = [r for r in reader]

现在，我想不使用train_test_split将数据拆分为80-10-10拆分。因此，我这样做：

train_test_divide = int(0.8 * len(annotated_data))
X_train, X_test = annotated_data[:train_divide], annotated_data[train_divide:]

，对于其他10-10个拆分，也是如此。因此，现在我想将标签列复制到其自己的数组中，以便可以将其放入MLPClassifier mlp.fit(X_train, y_train)中。

我尝试过：

for row in X_train:
y_train = row[0]

我得到len是1，而np.shape是()，所以我知道这已经是错误的。

所以我尝试了：

y_test = [row[0] for row in X_train]

当我打印len时，我得到16000，这是我想要的。如果我打印出np.shape，我将再次得到(16000, )，这正是我想要的。但是现在，如果我尝试mlp.fit(X_train, y_train)，我会收到一条错误消息，说Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'。是因为y_train将字母作为字符串存储或不存储吗？ y_train应该存储什么数据类型，我该如何解决？

将标签列复制到y_train中是由于错误引起的吗？允许的任何帮助

编辑：前几行如下：

A | 1 | 3 | 4 | 4 | ...

T | 3 | 5 | 3 | 9 | ...

等

Answer 1

我可以为您推荐我使用的方法，该方法可以通过熊猫和sklearn train_test_split工作

import pandas as pd
df = pd.read_csv('Letter.csv')
labels = df[df.columns[0]] # Column 0 because you say it is the first one, but check this index. 
# Better if you name the columns and call them by name
features = df[df.columns[1:]] # Again, check the content of features

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)

如果通过

每行包含一个字母

您的意思是每一行都包含字符串，不过您可能需要对字符串进行矢量化处理，然后才能将其输入到ML模型中。

您可以在csv文件中发布前几行吗？

MLPClassifier无法适合给定的训练标签

1 个答案: