Question

我正在使用Google colab，并且正在尝试训练卷积神经网络。为了分割大约11,500张图像的数据集，每个数据的形状为63x63x63。我使用了train_test_split中的sklearn。

test_split = 0.1
random_state = 42
X_train, X_test, y_train, y_test = train_test_split(triplets, df.label, test_size = test_split, random_state = random_state)

每次我的运行时断开连接时，我都需要运行此程序以继续进行。但是，仅此命令就需要将近10分钟（甚至更多）来运行。笔记本中的所有其他命令运行都非常快（可能在几秒钟或更短的时间内）。我不确定是什么问题。我尝试将运行时更改为GPU，并且我的互联网连接似乎非常稳定。问题可能是什么？

Answer 1

为什么要花那么多时间？

您的数据形状为11500x63x63x63。通常需要花费很长时间，因为数据形状很大。

说明：由于数据形状为11500x63x63x63，因此数据中大约有3x10 ^ 9个存储位置（实际值为2875540500 500）。通常，一台机器每秒可以执行10 ^ 7〜10 ^ 8条指令。由于python相对较慢，因此我认为google-colab每秒能够执行10 ^ 7条指令，

train_test_split = 3x10 ^ 9/10 ^ 7 = 300秒= 5分钟所需的最短时间

但是，train_test_split函数的实际时间复杂度几乎接近O(n)，但是由于庞大的数据操作，基于庞大的数据传递和检索操作，此函数导致瓶颈。这样会使您的脚本耗时几乎翻倍。

如何解决？

一个简单的解决方案是传递要素数据集的索引，而不是直接传递要素数据集（在这种情况下，要素数据集为triplets）。这将节省复制train_test_split函数内部返回的训练和测试功能所需的额外时间。根据当前使用的数据类型，这可能会提高性能。

为了进一步解释我在说什么，我添加了一个简短代码，

# Building a index array of the input feature
X_index = np.arange(0, 11500)

# Passing index array instead of the big feature matrix
X_train, X_test, y_train, y_test = train_test_split(X_index, df.label, test_size=0.1, random_state=42)

# Extracting the feature matrix using splitted index matrix
X_train = triplets[X_train]
X_test = triplets[X_test]

在上面的代码中，我传递了输入要素的索引，并根据train_test_split函数对其进行了拆分。此外，我正在手动提取训练和测试数据集，以减少返回大矩阵的时间复杂性。

估计的时间改进取决于您当前使用的数据类型。为了进一步加强我的答案，我添加了一个使用NumPy矩阵和在google-colab上测试过的数据类型的基准。基准代码和输出如下。但是，在某些情况下，它并没有像基准测试中那样改善太多。

代码：

import timeit
import numpy as np
from sklearn.model_selection import train_test_split

def benchmark(dtypes):
    for dtype in dtypes:
        print('Benchmark for dtype', dtype, end='\n'+'-'*40+'\n')
        X = np.ones((5000, 63, 63, 63), dtype=dtype)
        y = np.ones((5000, 1), dtype=dtype)
        X_index = np.arange(0, 5000)

        start_time = timeit.default_timer()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
        print(f'Time elapsed: {timeit.default_timer()-start_time:.3f}')

        start_time = timeit.default_timer()
        X_train, X_test, y_train, y_test = train_test_split(X_index, y, test_size=0.1, random_state=42)

        X_train = X[X_train]
        X_test = X[X_test]
        print(f'Time elapsed with indexing: {timeit.default_timer()-start_time:.3f}')
        print()

benchmark([np.int8, np.int16, np.int32, np.int64, np.float16, np.float32, np.float64])

输出：

Benchmark for dtype <class 'numpy.int8'>
----------------------------------------
Time elapsed: 0.473
Time elapsed with indexing: 0.304

Benchmark for dtype <class 'numpy.int16'>
----------------------------------------
Time elapsed: 0.895
Time elapsed with indexing: 0.604

Benchmark for dtype <class 'numpy.int32'>
----------------------------------------
Time elapsed: 1.792
Time elapsed with indexing: 1.182

Benchmark for dtype <class 'numpy.int64'>
----------------------------------------
Time elapsed: 2.493
Time elapsed with indexing: 2.398

Benchmark for dtype <class 'numpy.float16'>
----------------------------------------
Time elapsed: 0.730
Time elapsed with indexing: 0.738

Benchmark for dtype <class 'numpy.float32'>
----------------------------------------
Time elapsed: 1.904
Time elapsed with indexing: 1.400
    
Benchmark for dtype <class 'numpy.float64'>
----------------------------------------
Time elapsed: 5.166
Time elapsed with indexing: 3.076

为什么train_test_split需要很长时间才能运行？

1 个答案:

为什么要花那么多时间？

如何解决？

代码：

输出：