我想在一堆矩阵上训练一个随机森林(下面的第一个链接为例)。我想将它们分类为“g”或“b”(好或坏,a或b,1或0,无关紧要)。
我已经调用了脚本randfore.py。我目前正在使用10个示例,但是一旦我实际启动并运行,我将使用更大的数据集。
以下是代码:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import os
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
working_dir = os.getcwd() # Grabs the working directory
directory = working_dir+"/fakesourcestuff/" ## The actual directory where the files are located
sources = list() # Just sets up a list here which is going to become the input for the random forest
for i in range(10):
cutoutfile = pd.read_csv(directory+ "image2_with_fake_geotran_subtracted_corrected_cutout_" + str(i) +".dat", dtype=object) ## Where we get the input data for the random forest from
sources.append(cutoutfile) # add it to our sources list
targets = pd.read_csv(directory + "faketargets.dat",sep='\n',header=None, dtype=object) # Reads in our target data... either "g" or "b" (Good or bad)
sources = pd.DataFrame(sources) ## I convert the list to a dataframe to avoid the "ValueError: cannot copy sequence with size 99 to array axis with dimension 1" error. Necessary?
# Training sets
X_train = sources[:8] # Inputs
y_train = targets[:8] # Targets
# Random Forest
rf = RandomForestClassifier(n_estimators=10)
rf_fit = rf.fit(X_train, y_train)
这是当前的错误输出:
Traceback (most recent call last):
File "randfore.py", line 31, in <module>
rf_fit = rf.fit(X_train, y_train)
File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 247, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "/home/ithil/anaconda2/envs/iraf27/lib/python2.7/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
我尝试制作dtype =对象,但它没有帮助。我只是不确定我需要采取什么样的操作才能完成这项工作。
我认为问题是因为我附加到源的文件不仅仅是数字,而是数字,逗号和各种方括号的混合(它基本上是一个大矩阵)。有一种自然的方式来导入它吗?特别是方括号可能是一个问题。
在我将源转换为DataFrame之前,我收到以下错误:
ValueError:无法将大小为99的序列复制到维度为1的数组轴 这是由于我输入的尺寸(100行长)和我的目标有10行1列。
以下是第一个被读入切口的文件(它们都是完全相同的样式)的内容用作输入: https://pastebin.com/tkysqmVu
以下是faketargets.dat的内容,目标: https://pastebin.com/632RBqWc
有什么想法吗?非常感谢。我相信这里有很多根本的混乱。
答案 0 :(得分:0)
尝试写作:
X_train = sources.values[:8] # Inputs
y_train = targets.values[:8] # Targets
我希望这能解决你的问题!