我目前有一个基于R的算法,按日期对data.table进行排序,然后找到最近的非NA /非空值。我发现以下内容取得了一些成功 StackOverflow问题为一些相对较大的数据集实现回填算法:
Computing the first non-missing value from each column in a DataFrame
我已经在Python和R中实现了一个解决方案,但我的Python解决方案似乎运行得慢得多。
library(data.table)
library(microbenchmark)
test_values <- rnorm(100000)
test_values[sample(1:length(test_values), size = 10000)] <- NA
test_values_2 <- rnorm(100000)
test_values_2[sample(1:length(test_values), size = 10000)] <- NA
test_ids <- rpois(100000, lambda = 100)
random_timestamp <- sample(x = seq(as.Date('2000-01-01'), as.Date('2017-01-01'), by = 1), size = 100000, replace = TRUE)
dt <- data.table(
'id' = test_ids,
'date' = random_timestamp,
'v1' = test_values,
'v2' = test_values_2
)
# Simple functions for backfilling
backfillFunction <- function(vector) {
# find the vector class
colClass <- class(vector)
if (all(is.na(vector))) {
# return the NA of the same class as the vector
NA_val <- NA
class(NA_val) <- colClass
return(NA_val)
} else {
# return the first non-NA value
return(vector[min(which(!is.na(vector)))])
}
}
print(microbenchmark(
dt[order(-random_timestamp), lapply(.SD, backfillFunction), by = 'id', .SDcols = c('v1', 'v2')]
))
Unit: milliseconds
expr min lq
dt[order(-random_timestamp), c(lapply(.SD, backfillFunction), list(.N)), by = "id", .SDcols = c("v1", "v2")] 9.976708 12.29137
mean median uq max neval
15.4554 14.47858 16.75997 112.9467 100
Python解决方案:
import timeit
setup_statement = """
import numpy as np
import pandas as pd
import datetime
start_date = datetime.datetime(2000, 1, 1)
end_date = datetime.datetime(2017, 1, 1)
step = datetime.timedelta(days=1)
current_date = start_date
dates = []
while current_date < end_date:
dates.append(current_date)
current_date += step
date_vect = np.random.choice(dates, size=100000, replace=True)
test_values = np.random.normal(size=100000)
test_values_2 = np.random.normal(size=100000)
na_loc = [np.random.randint(0, 100000, size=10000)]
na_loc_2 = [np.random.randint(0, 100000, size=10000)]
id_vector = np.random.poisson(100, size=100000)
for i in na_loc:
test_values[i] = None
for i in na_loc_2:
test_values_2[i] = None
DT = pd.DataFrame(
data={
'id': id_vector,
'date': date_vect,
'v1': test_values,
'v2': test_values_2
}
)
GT = DT.sort_values(['id', 'date'], ascending=[1, 0]).groupby('id')
"""
print(timeit.timeit('{col: GT[col].apply(lambda series: series[series.first_valid_index()] if series.first_valid_index() else None) for col in DT.columns}', number=100, setup=setup_statement)*1000/100)
66.5085821699904
我在Python上的平均时间是67毫秒,但对于R它只有15,即使方法看起来相对相似(在组内的每个列上应用一个函数)。为什么我的R代码比我的Python代码快得多,我怎样才能在Python中实现类似的性能?
答案 0 :(得分:2)
编辑增加另一个更清晰的答案。定义一个获取第一个非缺失值的函数,除非它们全部缺失然后返回null。
def find_first(s):
s = s.dropna()
if len(s) == 0:
return np.nan
return s.iloc[0]
GT = DT.sort_values(['id', 'date'], ascending=[True, False])
GT.groupby(['id']).agg(find_first).reset_index()
也完成了
GT.set_index('id').stack().groupby(level=[0,1]).first().unstack()
堆叠值将自动删除缺失值并将它们全部放在一列中。然后你可以走第一排。这里有很多步骤,但大多数步骤只是重新整形以使其看起来正确。
DT.sort_values(['id', 'date'], ascending=[True, False])\
.set_index(['date', 'id'])\
.stack()\
.reset_index()\
.groupby(['id', 'level_2'])\
.first()\
.set_index('date', append=True)\
.squeeze()\
.unstack('level_2')\
.reset_index()\
.rename_axis(None, axis='columns')
输出
id date v1 v2
0 53 2015-08-29 NaN 1.700798
1 59 2000-04-25 -0.560505 0.371487
2 60 2011-01-07 NaN 0.627205
3 61 2001-03-13 NaN 0.245077
4 61 2011-01-11 0.992256 NaN
5 62 2005-04-14 -0.541771 -1.559377
6 63 2016-03-25 0.338544 0.176700
7 64 2016-07-12 -0.297969 -0.977407
8 65 2009-04-24 NaN -0.429607
9 65 2009-05-04 1.829951 NaN
额外:您可以像这样大大改善数据框架的构建
dates = pd.date_range('2000-1-1', '2017-1-1')
date_vect = np.random.choice(dates, size=100000, replace=True)
test_values = np.random.normal(size=100000)
test_values_2 = np.random.normal(size=100000)
na_loc = [np.random.randint(0, 100000, size=10000)]
na_loc_2 = [np.random.randint(0, 100000, size=10000)]
id_vector = np.random.poisson(100, size=100000)
test_values[na_loc] = None
test_values_2[na_loc_2] = None