Question

我有一个Pandas数据框，其中将二进制数据存储在约360.000个条目的列中。我正在寻找一种更有效的方法来找到0-> 1和1-> 0之间的变化。

当前，我遍历它并通过为每个索引评估它来检查特定条件，这也许是很容易读懂的描述，但是由于多次使用该功能，确实是较大脚本的瓶颈。最后一个索引没有检查，但这不是关键。

for i in range(0, len(df.Binary) - 1):
    if df.Binarywindow[i] == 0 and df.Binarywindow[i+1] == 1:
        startedge.append(i)
    elif df.Binarywindow[i] == 1 and df.Binarywindow[i+1] == 0:
        endedge.append(i)

您能帮我重写吗？

Answer 1

由于append（）方法与内存交互的方式，您提到的方法对于大量数据确实会产生非常慢的结果。本质上，您正在重写同一内存部分〜360,000次，并通过单个条目对其进行扩展。您可以通过转换为numpy数组并使用单个操作搜索边缘来显着加快此速度。我写了一个最小的例子来演示随机的二进制数据集。

binaries = np.random.randint(0,2,200000)
Binary = pd.DataFrame(binaries)

t1 = time.time()

startedge, endedge = pd.DataFrame([]), pd.DataFrame([])
for i in range(0, len(Binary) - 1):
    if Binary[0][i] == 0 and Binary[0][i+1] == 1:
        startedge.append([i])
    elif Binary[0][i] == 1 and Binary[0][i+1] == 0:
        endedge.append([i])

t2 = time.time()
print(f"Looping through took {t2-t1} seconds")

# Numpy based method, including conversion of the dataframe
t1 = time.time()
binary_array = np.array(Binary[0])

startedges = search_sequence_numpy(binary_array, np.array([0,1]))
stopedges = search_sequence_numpy(binary_array, np.array([1,0]))

t2 = time.time()
print(f"Converting to a numpy array and looping through required {t2-t1} seconds")

输出：

Looping through took 56.22933220863342 seconds
Converting to a numpy array and looping through required 0.029932022094726562  seconds

对于序列搜索功能，我使用了答案Searching a sequence in a NumPy array中的代码

def search_sequence_numpy(arr,seq):
""" Find sequence in an array using NumPy only.

Parameters
----------    
arr    : input 1D array
seq    : input 1D array

Output
------    
Output : 1D Array of indices in the input array that satisfy the 
matching of input sequence in the input array.
In case of no match, an empty list is returned.
"""

# Store sizes of input array and sequence
Na, Nseq = arr.size, seq.size

# Range of sequence
r_seq = np.arange(Nseq)

# Create a 2D array of sliding indices across the entire length of input array.
# Match up with the input sequence & get the matching starting indices.
M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)

# Get the range of those indices as final output
if M.any() >0:
    return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
else:
    return []         # No match found

优化边缘搜索

1 个答案: