Question

我正在大型csv文件（150万行）上运行此代码。有没有优化的方法？

df是熊猫数据框。我花了一行时间，想知道在下面的1000行中第一行会发生什么：

我找到我的值+ 0.0004或我找到我的值-0.0004

result = []
for row in range(len(df)-1000):
    start = df.get_value(row,'A')
    win = start + 0.0004
    lose = start - 0.0004
    for n in range(1000):
        ref = df.get_value(row + n,'B')
        if ref > win:
            result.append(1)
            break
        elif ref <= lose:
            result.append(-1)
            break
        elif n==999 :
            result.append(0)

数据帧就像：

         timestamp           A         B
0   20190401 00:00:00.127  1.12230  1.12236
1   20190401 00:00:00.395  1.12230  1.12237
2   20190401 00:00:00.533  1.12229  1.12234
3   20190401 00:00:00.631  1.12228  1.12233
4   20190401 00:00:01.019  1.12230  1.12234
5   20190401 00:00:01.169  1.12231  1.12236

结果是：result [0,0,1,0,0,1，-1,1，…]

这是可行的，但是要花费如此长时间才能处理如此大的文件。

Answer 1

要为“第一个离群值”生成值，请定义以下函数：

def firstOutlier(row, dltRow = 4, dltVal = 0.1):
    ''' Find the value for the first "outlier". Parameters:
    row    - the current row
    dltRow - number of rows to check, starting from the current
    dltVal - delta in value of "B", compared to "A" in the current row
    '''
    rowInd = row.name                        # Index of the current row
    df2 = df.iloc[rowInd : rowInd + dltRow]  # "dltRow" rows from the current
    outliers = df2[abs(df2.B - row.A) >= dlt]
    if outliers.index.size == 0:  # No outliers within the range of rows
        return 0
    return int(np.sign(outliers.iloc[0].B - row.A))

然后将其应用于每一行：

df.apply(firstOutlier, axis=1)

此函数依赖于DataFrame具有包含以下内容的索引的事实：从0开始的连续数字，因此具有 ind -的索引我们可以调用df.iloc[ind]和 n 行的一部分来访问它的任何行，从此行开始，调用df.iloc[ind : ind + n]。

为进行测试，我将参数的默认值设置为：

dltRow = 4-从当前行开始，查看 4 行，
dltVal = 0.1-查找带有 B 列“相隔” 0.1 的行或来自当前行 A 的更多内容。

我的测试DataFrame是：

      A     B
0  1.00  1.00
1  0.99  1.00
2  1.00  0.80
3  1.00  1.05
4  1.00  1.20
5  1.00  1.00
6  1.00  0.80
7  1.00  1.00
8  1.00  1.00

结果（用于我的数据和参数的默认值）为：

0   -1
1   -1
2   -1
3    1
4    1
5   -1
6   -1
7    0
8    0
dtype: int64

根据需要，将参数的默认值分别更改为 1000 和 0.0004 。

Answer 2

这个想法是在保持A值的排序列表的同时循环遍历B和A。然后，对于每个B，找出损失最高的A和获胜的最低A。由于这是一个排序列表，因此搜索时O(log(n))。只有那些索引为最后1000个的A用于设置结果向量。之后，不再等待A的{{1}}从此排序列表中删除，以保持较小。

这是示例输出：

import numpy as np
import bisect
import time

N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4

A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)

l = []

t_start = time.time()

for i in range(N):
    a = (A[i],i)
    bisect.insort(l,a)
    b = B[i]
    firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
    lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
    for j in range(lastWinInd):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = 1
    for j in range(firstLoseInd,len(l)):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = -1
    del l[firstLoseInd:]
    del l[:lastWinInd]

t_done = time.time()

print(A)
print(B)
print(result)
print(t_done - t_start)

在[ 0.22643589 0.96092354 0.30098532 0.15569044 0.88474775 0.25458535 0.78248271 0.07530432 0.3460113 0.0785128 ] [ 0.83610433 0.33384085 0.51055061 0.54209458 0.13556121 0.61257179 0.51273686 0.54850825 0.24302884 0.68037965] [ 1. -1. 0. 1. -1. 0. -1. 1. 0. 1.]和N = int(1e6)上，我的计算机花费了大约3.4秒。

我如何优化此python循环？

2 个答案: