假设您有一个包含开始,结束和信号列的熊猫数据框。
您正在尝试使用此信号从头到尾填充一个大的numpy数组。
我已经通过应用和列表理解实现了它。
因为此链接已声明How to iterate over rows in a DataFrame in Pandas? 列表理解比appy更快。
但是,如何对其进行向量化或为其编写cython例程?
有什么主意吗?
import numpy as np
import pandas as pd
import time
import random
def f(start,end,signal,chrBasedSignalArray):
chrBasedSignalArray[start:end]+=signal
def updateChrBasedSignalArray(data_row,chrBasedSignalArray):
chrBasedSignalArray[data_row['start']:data_row['end']] += data_row['signal']
numberofRows=1000000
startList = random.sample(range(1, 240000000), numberofRows)
endList = [x+100 for x in startList]
signalList = [random.randrange(0,10) for i in range(numberofRows)]
df = pd.DataFrame({'chrom': ['chr1'] * numberofRows, 'start': startList, 'end':endList, 'signal':signalList})
print('##################################')
chrBasedSignalArray = np.zeros(240000000, dtype=np.float32)
print('Before np.sum(chrBasedSignalArray: %f' %np.sum(chrBasedSignalArray))
start_time = time.time()
[f(start,end,signal,chrBasedSignalArray) for start,end,signal in zip(df['start'],df['end'],df['signal'])]
print("--- %s seconds using list comprehension---" % ((time.time() - start_time)))
print('After np.sum(chrBasedSignalArray): %f' %np.sum(chrBasedSignalArray))
print('##################################')
print('##################################')
chrBasedSignalArray = np.zeros(240000000, dtype=np.float32)
print('Before np.sum(chrBasedSignalArray: %f' %np.sum(chrBasedSignalArray))
start_time = time.time()
df.apply(updateChrBasedSignalArray, chrBasedSignalArray=chrBasedSignalArray, axis=1)
print("--- %s seconds using apply---" % ((time.time() - start_time)))
print('After np.sum(chrBasedSignalArray): %f' %np.sum(chrBasedSignalArray))
print('##################################')