Question

我有一个包含超过1.5亿个数据点的一维np数组，它是使用二进制数据文件中的np.fromfile填充的。

鉴于该数组，我需要添加一个值＆＃39; val＆＃39;除非该点等于“x”，否则到每一点。

此外，数组中的每个值（取决于其值）将对应于我想要存储在另一个列表中的另一个值。

变量说明：

** temps = np.arange（-30.00,0.01,0.01，dtype =＆＃39; float32＆＃39;）

** slr是一个列表，temps中的索引0对应于slr中的索引0，依此类推。两个列表的长度相同

这是我目前的代码：

import sys
import numpy as np

with open("file.dat", "rb") as f:
array = np.fromfile(f, dtype=np.float32)
f.close()

#This is the process below that I need to speed up 

T_SLR = np.array(np.zeros(len(array), dtype='Float64'))
for i in range(0,len(array)):
    if array[i] != float(-9.99e+08):
        array[i] = array[i] - 273.15     
    if array[i] in temps:
        index, = np.where(temps==array[i])[0]
        T_SLR = slr[index]
    else:
        T_SLR[i] = 0.00

Answer 1

代码中最慢的一点是列表中的O（n）遍历：

if array[i] in temps:
    index, = np.where(temps==array[i])[0]

由于temps不大，您可以将其转换为dict：

temps2 = dict(zip(temps, range(len(temps)))

并使其成为O（1）：

if array[i] in temps2:
    index = temps2[array[i]]

您还可以尝试避免for循环加速。例如，以下代码：

for i in range(0,len(array)):
    if array[i] != float(-9.99e+08):
        array[i] = array[i] - 273.15

可以完成：

array[array!=float(-9.99e+08)] -= 273.15

代码中的另一个问题是浮点比较。你不应该使用完全相等的运算符==或!=，尝试使用公差numpy.isclose，或者通过乘以100将float转换为int。

Answer 2

由于您的选择标准似乎是逐点的，因此您无需一次读取所有1.5亿个点。您可以使用count上的np.fromfile参数来限制一次比较的数组的大小。一旦你处理了大于几千的块，for循环就不重要了，你不会用来自所有1.5亿点的巨大阵列来锻炼你的记忆。

slr和temps看起来像是一个索引转换表。您可以使用浮动比较和计算查找替换temps上的搜索。由于-9.99e + 8显然超出了搜索标准，因此您不需要对这些点进行任何特殊处理。

f = open("file.dat", "rb")
N = 10000
T_SLR = np.zeros(size_of_TMPprs/4, dtype=np.float64)
t_off = 0
array = np.fromfile(f, count=N, dtype=np.float32)
while array.size > 0:
   array -= 273.15
   index = np.where((array >= -30) & (array <= 0))[0]
   T_SLR[t_off+index] = slr[np.round((array[index]+30)*100)]
   t_off += array.size
   array = np.fromfile(f, count=N, dtype=np.float32)

如果您希望T_SLR在测量值超过零时包含slr中的最后一个条目，则可以进一步简化此操作。然后，您可以使用

array = np.maximum(np.minimum(array, 0), -30)

限制array中的值范围，并将其用于slr的计算索引，如上所述（在这种情况下不使用where）。

Answer 3

使用with open时，请勿自行关闭。 with上下文自动完成。我还将通用array名称更改为隐藏其他内容的风险较小的内容（例如np.array？）

with open("file.dat", "rb") as f:
    data = np.fromfile(f, dtype=np.float32)

首先无需在np.zeros中包裹np.array。它已经是一个数组。如果len(data)为1d，data即可，但我更喜欢使用shape元组。

T_SLR = np.zeros(data.shape, dtype='Float64')

布尔索引/屏蔽允许您立即对整个数组执行操作：

mask = data != -9.99e8   # don't need `float` here
                         # using != test with floats is poor idea
data[mask] -= 273.15

我需要优化!=测试。它可以用于整数，但不适用于浮点数。像np.abs(data+9.99e8)>1这样的东西更好

同样in对浮点数不是一个好的测试。对于整数，in和where会执行多余的工作。

假设temps为1d，np.where(...)返回1个元素元组。 [0]选择该元素，返回一个数组。 ,中index,是多余的index, = np.where()。 <{1}}没有[0]应该有效。

根据数组的初始化方式，

T_SLR[i]已经为0。无需重新设置。

for i in range(0,len(array)):
    if array[i] in temps:
        index, = np.where(temps==array[i])[0]
        T_SLR = slr[index]
    else:
        T_SLR[i] = 0.00

但我认为我们也可以摆脱这种迭代。但是我以后会把这个讨论留下来。

In [461]: temps=np.arange(-30.00,0.01,0.01, dtype='float32')
In [462]: temps
Out[462]: 
array([ -3.00000000e+01,  -2.99899998e+01,  -2.99799995e+01, ...,
        -1.93138123e-02,  -9.31358337e-03,   6.86645508e-04], dtype=float32)
In [463]: temps.shape
Out[463]: (3001,)

难怪array[i] in temps和np.where(temps==array[i])的速度很慢

我们可以通过查看in

来删除where

In [464]: np.where(temps==12.34)
Out[464]: (array([], dtype=int32),)
In [465]: np.where(temps==temps[3])
Out[465]: (array([3], dtype=int32),)

如果没有匹配where返回一个空数组。

In [466]: idx,=np.where(temps==temps[3])
In [467]: idx.shape
Out[467]: (1,)
In [468]: idx,=np.where(temps==123.34)
In [469]: idx.shape
Out[469]: (0,)

如果匹配在列表的早期，则

in可能比where更快，但如果不是更多，则匹配结束，或者没有匹配。< / p>

In [478]: timeit np.where(temps==temps[-1])[0].shape[0]>0
10000 loops, best of 3: 35.6 µs per loop
In [479]: timeit temps[-1] in temps
10000 loops, best of 3: 39.9 µs per loop

四舍五入的方法：

In [487]: (np.round(temps,2)/.01).astype(int)
Out[487]: array([-3000, -2999, -2998, ...,    -2,    -1,     0])

我建议调整：

T_SLR = -np.round(data, 2)/.01).astype(int)

Answer 4

由于temps已排序，您可以使用np.searchsorted并避免所有显式循环：

array[array != float(-9.99e+08)] -= 273.15
indices = np.searchsorted(temps, array)
# Remove indices out of bounds
mask = indices < array.shape[0]
# Remove in-bounds indices not matching exactly
mask[mask] &= temps[indices[mask]] != array[mask]
T_SLR = np.where(mask, slr[indices[mask]], 0)

Python：快速遍历np.array

4 个答案: