Question

我正在尝试读取特定格式的ascii文件（基于文本）。我已经完成了一些行分析，并且在循环中使用了发布时间。我正在尝试是否可以提高循环内的代码的性能。

我尝试过的事情

通过像official docs中那样使用缓冲区接口初始化numpy数组，可以更快地为numpy数组建立索引，我希望这可以提高很多速度，但几乎没有什么改变。
自定义类型转换函数（无需python交互）来替换int（line [0：5]），但最终成本很高

用于类型转换的自定义函数

cdef int fast_atoi(str buf):
    cdef int i=0 ,c = 0, x = 0
    for i in range(5):
        c = buf[i]
        if c > 47 and c < 58:
            x = x * 10 + c - 48
    return x

我要优化的主要代码块

def func(filename):
        cdef np.ndarray[np.int32_t] a1
        cdef np.ndarray[object] a2
        cdef np.ndarray[object] a3
        cdef np.ndarray[np.int32_t] a4
        cdef int count = 0
        cdef int n_lines
        cdef str line
        with open(filename) as inf:
            next(inf)
            n_lines = int(next(inf))
            a1 = np.zeros(n_atoms, dtype=np.int32)
            a2 = np.zeros(n_atoms, dtype=object)
            a3 = np.zeros(n_atoms, dtype = object)
            a4 =  np.zeros(n_atoms, dtype=np.int32)
            for i,line in enumerate(inf):
                if i == n_lines:
                    break
                try:
                    a1[i] =  int(line[0:5]) #custom function fast_atoi(line[0:5])
                    a2[i] = line[5:10].strip()
                    a3[i] = line[10:15].strip()
                    a4[i] = int(line[15:20])
                except (ValueError, TypeError) as e:
                    break

我有一个4.3 mb的文件

Author
n_lines
    1xyz      A    1   5.202   4.356   3.155
    1mno     A1    2   5.119   4.411   3.172
    1mno     A2    3   5.155   4.283   3.104
    1nnn     B3    4   5.247   4.318   3.237
    1xax     KA    5   5.306   4.421   3.075
    1ooo     MA    6   5.383   4.347   3.054
    1cbd     NB    7   5.257   4.474   2.941
    1orc     OB1   8   5.189   4.404   2.893

当前的实现在我的计算机上平均需要76毫秒，添加上述自定义功能会使情况更糟。

如果能提出一些建议，我将不胜感激。我是cython的新手。

改善Cython功能的性能以加快文件读取速度

0 个答案: