为什么在读取.mat文件并对其进行处理时,时间成本越来越大

时间:2019-05-27 09:10:53

标签: python pandas dataframe

我得到了6500张有关 ECG 的苍蝇。
我想从这些文件中读取它,并对其进行一些处理,但是我发现时间成本比我以前认为的和tqdm所估计的要高得多。
因此,如果我的代码有任何问题,我会感到困惑。
这是mat文件示例:

# the number of each array are given same for convience, in fact they are totally not same    
mat1 = scipy.io.loadmat('Train/TRAIN0001.mat')
mat1
{'I': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'II': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'III': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'V1': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'V2': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'V3': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'V4': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'V5': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'V6': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 '__globals__': [],
 '__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Mon May 6 16:56:48 2019',
 '__version__': '1.0',
 'aVF': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'aVL': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'aVR': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
 'age': array([[63]], dtype=int32),
 'sex': array(['FEMALE'], dtype='<U6'),
}

代码如下:

def read_mat(mat_path, index):
    mat = scipy.io.loadmat(mat_path)
    mat_df = pd.DataFrame({
                            'I_' + str(index): mat['I'][0],
                            'II_' + str(index): mat['II'][0],
                            'III_' + str(index): mat['III'][0],
                            'V1_' + str(index): mat['V1'][0],
                            'V2_' + str(index): mat['V2'][0],
                            'V3_' + str(index): mat['V3'][0],
                            'V4_' + str(index): mat['V4'][0],
                            'V5_' + str(index): mat['V5'][0],
                            'V6_' + str(index): mat['V6'][0],
                            'aVF_' + str(index): mat['aVF'][0],
                            'aVL_' + str(index): mat['aVL'][0],
                            'aVR_' + str(index): mat['aVR'][0]
    })

    age = pd.DataFrame({'age': mat['age'][0]})
    sex = pd.DataFrame({'sex': mat['sex']})
    sex['sex'] = sex['sex'].apply(lambda x: 1 if x == 'male' (0 if x == 'female' else 2))

    return mat_df, age, sex

def read_data():

    # target.csv save the label of every people
    tar = pd.read_csv('target.csv')

    # ECG has collected 5000 samples of each people, so I want to treat every sample as a feature
    train = pd.DataFrame(columns=[i for i in range(0, 5000)])
    for i in tqdm(range(1, 6501)):
        tmp_filename = 'TRAIN' + str(i).zfill(4)
        train_tmp, age, sex = read_mat('Train/' + tmp_filename, i)
        train_tmp = train_tmp.transpose()
        train_tmp['age'] = age['age'][0]
        train_tmp['sex'] = sex['sex'][0]
        train_tmp['target'] = tar['label'][i-1]

        # add 5000 samples of each mat file into train DataFrame
        train = train.append(train_tmp)
        del train_tmp, age, sex

    target = pd.Series()
    target = train['target']

    return train, target, tar

以下是时间成本:

  

0%| 11/6500 [00:00 <01:01,105.36it / s]
  0%| 19/6500   [00:00 <01:08,94.25it / s]
  ...
   ...
   10%| 636/6500 [02:14 <39:37,   2.47it / s]
   10%| 640/6500 [02:15 <39:52,2.45it / s]
   ...
   ...
   20%| 1322/6500 [09:25 <1:12:56,1.18it / s]
   20%| 1328/6500   [09:30 <1:13:27,1.17it / s]
   ...
   ...
   30%| 1918/6500   [20:02 <1:13:53,1.23s / it]
   ...
   ...
   40%| 2586/6500   [35:52 <1:44:42,1.61s / it]
   ...
   ...
   50%| 3237/6500   [2:08:11 <10:58:41,12.09s / it]

当我阅读了50%的Mat文件时,它估计将花费10个小时以上。
而且我想知道我的代码有什么问题,所以会花费太多时间。
有人可以给我一些有关我的代码的提示吗?
预先感谢。

1 个答案:

答案 0 :(得分:1)

免责声明:检查的正确方法是通过探查器运行代码,而我没有这么做(因为这需要伪造长度合理的输入数据等)。

看看for循环的主体,唯一可以合理增加执行时间的行是

train = train.append(train_tmp)

The doc says specifically to avoid this(可能是由于Schlemiel the painter情况造成的):

  

将行迭代添加到DataFrame可能比单个连接更多地占用大量计算资源。更好的解决方案是将这些行添加到列表中,然后一次将列表与原始DataFrame连接起来。