将数据分为培训和测试

时间:2019-06-11 03:09:56

标签: python machine-learning training-data

我想复制本教程,以使用不同的数据集对两组https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/进行分类,但是尽管努力,却无法做到这一点。我是编程新手,不胜感激可以提供帮助或提示。

我的数据集很小(每组240个文件),文件名为01-0240。

我认为是围绕这些代码行的。

    if is_trian and filename.startswith('cv9'):
        continue
    if not is_trian and not filename.startswith('cv9'):
        continue

还有这些

            trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
            save_dataset([trainX,trainy], 'train.pkl')

            testY = [0 for _ in range(100)] + [1 for _ in range(100)]
            save_dataset([testX,testY], 'test.pkl')

到目前为止遇到两个错误:

  

输入数组应具有与目标数组相同数量的样本。   找到483个输入样本和200个目标样本。

     

无法打开文件(无法打开文件:name ='model.h5',errno =   2,错误消息='没有这样的文件或目录',标志= 0,o_flags =   0)

我非常感谢您的及时帮助。

谢谢。

// 部分代码更加清晰。 //

# load all docs in a directory
def process_docs(directory, is_trian):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any transcript in the test set

正如教程中所提到的,我想在下面添加一个参数来指示是处理培训文件还是测试文件。或者如果还有另一个     方式请分享

        if is_trian and filename.startswith('----'):
            continue
        if not is_trian and not filename.startswith('----'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc)
        # add to list
        documents.append(tokens)
    return documents

# save a dataset to file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')

# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]

save_dataset([testX,testY], 'test.pkl')

2 个答案:

答案 0 :(得分:0)

您应该发布更多代码,但是听起来您的问题是整理数据。假设您在“健康”文件夹中有240个文件,在“病”文件夹中有240个文件。然后,您需要将所有健康人的标签标记为0,并将所有病人的标签标记为1。尝试执行以下操作:

from glob import glob 
from sklearn.model_selection import train_test_split

#get the filenames for healthy people 
xhealthy = [ fname for fname in glob( 'pathToData/healthy/*' )]

#give healthy people label of 0
yhealthy = [ 0 for i in range( len( xhealthy ))]

#get the filenames of sick people
xsick    = [ fname for fname in glob( 'pathToData/sick/*')]

#give sick people label of 1
ysick    = [ 1 for i in range( len( xsick ))]

#combine the data 
xdata = xhealthy + xsick 
ydata = yhealthy + ysick 

#create the training and test set 
X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, test_size=0.1)

然后使用X_train,Y_train训练模型并使用X_test,Y_test对其进行测试-请记住,您的X_data只是仍需要处理的文件名。您发布的代码越多,更多的人可以帮助您解决问题。

答案 1 :(得分:0)

我能够通过手动将数据集分为训练集和测试集,然后单独标记每个集来解决该问题。我当前的数据集很小,因此一旦有能力,我将继续为大型数据集寻找更好的解决方案。提供以结束问题。