我想复制本教程,以使用不同的数据集对两组https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/进行分类,但是尽管努力,却无法做到这一点。我是编程新手,不胜感激可以提供帮助或提示。
我的数据集很小(每组240个文件),文件名为01-0240。
我认为是围绕这些代码行的。
if is_trian and filename.startswith('cv9'):
continue
if not is_trian and not filename.startswith('cv9'):
continue
还有这些
trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
save_dataset([trainX,trainy], 'train.pkl')
testY = [0 for _ in range(100)] + [1 for _ in range(100)]
save_dataset([testX,testY], 'test.pkl')
到目前为止遇到两个错误:
输入数组应具有与目标数组相同数量的样本。 找到483个输入样本和200个目标样本。
无法打开文件(无法打开文件:name ='model.h5',errno = 2,错误消息='没有这样的文件或目录',标志= 0,o_flags = 0)
我非常感谢您的及时帮助。
谢谢。
// 部分代码更加清晰。 //
# load all docs in a directory
def process_docs(directory, is_trian):
documents = list()
# walk through all files in the folder
for filename in listdir(directory):
# skip any transcript in the test set
正如教程中所提到的,我想在下面添加一个参数来指示是处理培训文件还是测试文件。或者如果还有另一个 方式请分享
if is_trian and filename.startswith('----'):
continue
if not is_trian and not filename.startswith('----'):
continue
# create the full path of the file to open
path = directory + '/' + filename
# load the doc
doc = load_doc(path)
# clean doc
tokens = clean_doc(doc)
# add to list
documents.append(tokens)
return documents
# save a dataset to file
def save_dataset(dataset, filename):
dump(dataset, open(filename, 'wb'))
print('Saved: %s' % filename)
# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')
# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([testX,testY], 'test.pkl')
答案 0 :(得分:0)
您应该发布更多代码,但是听起来您的问题是整理数据。假设您在“健康”文件夹中有240个文件,在“病”文件夹中有240个文件。然后,您需要将所有健康人的标签标记为0,并将所有病人的标签标记为1。尝试执行以下操作:
from glob import glob
from sklearn.model_selection import train_test_split
#get the filenames for healthy people
xhealthy = [ fname for fname in glob( 'pathToData/healthy/*' )]
#give healthy people label of 0
yhealthy = [ 0 for i in range( len( xhealthy ))]
#get the filenames of sick people
xsick = [ fname for fname in glob( 'pathToData/sick/*')]
#give sick people label of 1
ysick = [ 1 for i in range( len( xsick ))]
#combine the data
xdata = xhealthy + xsick
ydata = yhealthy + ysick
#create the training and test set
X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, test_size=0.1)
然后使用X_train,Y_train训练模型并使用X_test,Y_test对其进行测试-请记住,您的X_data只是仍需要处理的文件名。您发布的代码越多,更多的人可以帮助您解决问题。
答案 1 :(得分:0)
我能够通过手动将数据集分为训练集和测试集,然后单独标记每个集来解决该问题。我当前的数据集很小,因此一旦有能力,我将继续为大型数据集寻找更好的解决方案。提供以结束问题。