如何使用向量化或cython例程使用pandas数据帧填充numpy数组?

时间:2019-09-15 01:24:49

标签: arrays pandas numpy dataframe

假设您有一个包含开始,结束和信号列的熊猫数据框。

您正在尝试使用此信号从头到尾填充一个大的numpy数组。

我已经通过应用和列表理解实现了它。

因为此链接已声明How to iterate over rows in a DataFrame in Pandas? 列表理解比appy更快。

但是,如何对其进行向量化或为其编写cython例程?

有什么主意吗?

import numpy as np
import pandas as pd
import time
import random


def f(start,end,signal,chrBasedSignalArray):
    chrBasedSignalArray[start:end]+=signal

def updateChrBasedSignalArray(data_row,chrBasedSignalArray):
    chrBasedSignalArray[data_row['start']:data_row['end']] += data_row['signal']

numberofRows=1000000
startList = random.sample(range(1, 240000000), numberofRows)
endList = [x+100 for x in startList]
signalList = [random.randrange(0,10) for i in range(numberofRows)]

df = pd.DataFrame({'chrom': ['chr1'] * numberofRows, 'start': startList, 'end':endList, 'signal':signalList})

print('##################################')
chrBasedSignalArray = np.zeros(240000000, dtype=np.float32)
print('Before np.sum(chrBasedSignalArray: %f' %np.sum(chrBasedSignalArray))
start_time = time.time()
[f(start,end,signal,chrBasedSignalArray) for start,end,signal in zip(df['start'],df['end'],df['signal'])]
print("--- %s seconds using list comprehension---" % ((time.time() - start_time)))
print('After np.sum(chrBasedSignalArray): %f' %np.sum(chrBasedSignalArray))
print('##################################')

print('##################################')
chrBasedSignalArray = np.zeros(240000000, dtype=np.float32)
print('Before np.sum(chrBasedSignalArray: %f' %np.sum(chrBasedSignalArray))
start_time = time.time()
df.apply(updateChrBasedSignalArray, chrBasedSignalArray=chrBasedSignalArray, axis=1)
print("--- %s seconds using apply---" % ((time.time() - start_time)))
print('After np.sum(chrBasedSignalArray): %f' %np.sum(chrBasedSignalArray))
print('##################################')

0 个答案:

没有答案