我正在 kaggle 上训练 CNN,我的数据由两部分组成:1 个 csv 标签文件和 1 个图像文件夹。如何将 kaggle 上的数据拆分为训练测试拆分?谢谢。
这是一张示例图片:
和相关标签(来自 csv):
答案 0 :(得分:1)
下面的函数创建了训练、测试和验证生成器:
source dir - 包含所有图像的目录的完整路径
cvs_path - CSV 文件的路径,该文件的列 (x_col
) 包含文件名字符串,列 (y_col
) 包含类关联文件名的字符串
注意:source_dir/filename 导致 source_dir 中文件的路径
此函数会自动确定生成器的 batch_size 并在 model.fit
中为我们提供步骤,以便您在每个 epoch 中仅通过一次训练、测试或验证图像。 max_batch_size
根据内存限制指定您允许的最大批量大小 train_split - 在 0 和 1 之间浮动,指定用于训练的图像百分比 test_split - 在 0 和 1 之间浮动,指定用于训练的图像百分比,注意validation_split 是内部计算为 1 - train_split - test_split
target_size= tuple(height, width) 输入图像调整为
缩放 - 浮动像素重新缩放到像素* 比例(通常为 1/255)
class_mode - 有关详细信息,请参阅 keras flow_from_dataframe 通常使用“分类”
import os
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
def train_test_valid_split(source_dir, cvs_path,max_batch_size, train_split, test_split, x_col, y_col, class_mode, target_size, scale):
data=pd.read_csv(cvs_path).copy()
te_split=test_split/(1-train_split)
train_df=data.sample(n=None, frac=train_split, replace=False, weights=None, random_state=123, axis=0)
tr_batch_size= max_batch_size
tr_steps=int(len(train_df.index)//tr_batch_size)
dummy_df=data.drop(train_df.index, axis=0, inplace=False)
test_df=dummy_df.sample(n=None, frac=te_split, replace=False, weights=None, random_state=123, axis=0)
te_batch_size, te_steps=get_bs(len(test_df.index),max_batch_size )
valid_df=dummy_df.drop(test_df.index, axis=0)
v_batch_size,v_steps=get_bs(len(valid_df.index), max_batch_size)
gen=ImageDataGenerator(rescale=scale)
train_gen=gen.flow_from_dataframe(dataframe=train_df, directory=source_dir,batch_size=tr_batch_size, x_col=x_col, y_col=y_col,
target_size=target_size, class_mode=class_mode,seed=123, validate_filenames=False)
test_gen=gen.flow_from_dataframe(dataframe=test_df, directory=source_dir, batch_size=te_batch_size, x_col=x_col, y_col=y_col,
target_size=target_size, class_mode=class_mode, shuffle=False,validate_filenames=False)
valid_gen=gen.flow_from_dataframe(dataframe=valid_df, directory=source_dir,batch_size=v_batch_size, x_col=x_col, y_col=y_col,
target_size=target_size, class_mode=class_mode, shuffle=False,validate_filenames=False)
return train_gen, tr_steps, test_gen, te_steps, valid_gen , v_steps
def get_bs(length, b_max):
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]
steps=int(length//batch_size)
return batch_size, steps
CSV 文件格式为
file_id class_id
0 00000.jpg AFRICAN CROWNED CRANE
1 00001.jpg AFRICAN CROWNED CRANE
2 00002.jpg AFRICAN CROWNED CRANE
3 00003.jpg AFRICAN CROWNED CRANE
4 00004.jpg AFRICAN CROWNED CRANE
5 00005.jpg AFRICAN CROWNED CRANE
6 00006.jpg AFRICAN CROWNED CRANE
7 00007..jpg AFRICAN CROWNED CRANE
8 00008..jpg AFRICAN CROWNED CRANE
以下是使用示例
source_dir=r'c:\temp\birds\consolidated_images'
cvs_path=r'c:\temp\birds\birds.csv'
train_split=.8
test_split=.1
x_col='file_id'
y_col='class_id'
target_size=(224,224)
scale=1/127.5-1
max_batch_size=32
class_mode='categorical'
train_gen, train_steps, test_gen, test_steps, valid_gen, valid_steps=train_test_valid_split(source_dir,
cvs_path, max_batch_size, train_split, test_split, x_col, y_col, class_mode, target_size, scale)
print ('train steps: ', train_steps, ' test steps: ', test_steps, ' valid steps: ', valid_steps)
执行结果是
Found 30172 non-validated image filenames belonging to 250 classes.
Found 3772 non-validated image filenames belonging to 250 classes.
Found 3771 non-validated image filenames belonging to 250 classes.
train steps: 942 test steps: 164 valid steps: 419
现在使用这些生成器
epochs= 20 # set to what you want
history=model.fit(x=train_gen, epochs=epochs,steps_per_epoch=train_steps,
validation_data=valid_gen, validation_steps=valid_steps,
shuffle=False, verbose=1)
训练后
accuracy=model.evaluate(test_gen, steps=test_steps)[1]*100
print ('Model accuracy on test set is', accuracy)
或者做预测
predictions=model.predict(test_gen, steps=test_steps, verbose=1)