TRUE
的行数。因为我需要能够在相对规律的基础上做到1到2千5百万次,所以速度实际上非常重要:
我认为如何做到这一点的最有效/最快的单进程方式是Rcpp函数的多少(hm2
)。
我有限的性能分析能力表明我绝大部分时间花在了if(r_tll == xcolls){...
上。我似乎无法想到一个更快的不同算法(我在找到FALSE
后尝试突破循环并且速度慢得多。)
我可以认为:
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
head(m)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
#> [2,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#> [4,] TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
#> [5,] TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
#> [6,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
// [[Rcpp::export]]
int hm(const LogicalMatrix& x){
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}
我不明白为什么,但是在我的机器上如果我加入cols数量更快(如果有人可以解释为什么这样也会很棒):
// [[Rcpp::export]]
int hm2(const LogicalMatrix& x){
const int xrows = x.nrow();
// const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < 10; col++) {
r_ttl += x(row,col);
}
if(r_ttl == 10){
n_all_true += 1;
}
}
return n_all_true;
}
microbenchmark(hm(m), hm2(m), times = 1000)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> hm(m) 597.828 599.0995 683.3482 605.397 643.8655 1659.711 1000
#> hm2(m) 236.847 237.6565 267.8787 238.748 253.5280 683.221 1000
答案 0 :(得分:4)
使用OpenMP(我现在看到的是针对请求单线程解决方案的问题)和最少的代码更改(至少在我的4核Xeon上)仍然可以快30%。我有一种感觉,逻辑上的减少可能会做得更好但会留下另一天:
library(Rcpp)
library(microbenchmark)
m_rows <- 10L
m_cols <- 50000L
rebuild = FALSE
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', rebuild = rebuild)
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', rebuild = rebuild)
macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
macroExpand_omp <- function(NCOL) {
paste0('int hm_npjc_omp(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
#pragma omp parallel for reduction(+:n_all_true)
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
cppFunction(macroExpand(m_rows), rebuild = rebuild)
cppFunction(macroExpand_omp(m_rows), plugins = "openmp", rebuild = rebuild)
cppFunction('int hm_omp(const LogicalMatrix& x) {
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
#pragma omp parallel for reduction(+:n_all_true) schedule(static)
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', plugins = "openmp", rebuild = rebuild)
# using != as inner loop control - no difference, using pre-increment in n_all_true, no diff, static vs dynamic OpenMP, attempted to direct clang and gcc to unroll loops: didn't seem to work
set.seed(21)
m <- matrix(sample(c(TRUE, FALSE), m_cols * m_rows, replace = T), ncol = m_rows)
print(microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m),
hm_omp(m), hm_npjc_omp(m),
times = 1000))
我使用了GCC 4.9。 Clang 3.7的结果相似。
赠送:
Unit: microseconds
expr min lq mean median uq max neval
hm(m) 614.074 640.9840 643.24836 641.462 642.9920 976.694 1000
hm3(m) 2705.066 2768.3080 2948.39388 2775.992 2786.8625 43424.060 1000
hm_jmu(m) 591.179 612.3590 625.84484 612.881 613.8825 6874.428 1000
hm_npjc(m) 62.958 63.8965 64.89338 64.346 65.0550 144.487 1000
hm_omp(m) 91.892 92.6050 165.21507 93.758 98.8230 10026.583 1000
hm_npjc_omp(m) 43.129 43.6820 129.15842 44.458 47.0860 17636.875 1000
OpenMP魔术只是在编译和链接时包含-fopenmp
(由Rcpp,plugin="openmp"
处理),以及
#pragma omp parallel for reduction(+:n_all_true)schedule(static)
在这种情况下,外部循环是并行化的,结果是总和,因此减少语句告诉编译器分解问题,并将每个部分的总和减少为一个总和。 schedule(static)
描述了编译器和/或运行时如何在线程之间分配循环。在这种情况下,内环和外环的宽度都是已知的,因此static
是首选;如果说内部循环大小变化很大,那么dynamic
可能会更好地平衡线程之间的工作。
可以明确地告诉OpenMP每个线程需要多少循环迭代,但通常最好让编译器决定。
另一方面,我努力使用编译器标志,例如-funroll-loops
来替换内部循环宽度的丑陋但快速的硬编码(这不是问题的通用解决方案)。我测试了这些无济于事:见https://github.com/jackwasey/optimization-comparison
答案 1 :(得分:3)
这是你的功能,以及通过cppFunction
编译的输出:
require(Rcpp)
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
# file.*.cpp: In function ‘int hm(const LogicalMatrix&)’:
# file.*.cpp:12:29: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
# for(size_t row = 0; row < xrows; row++) {
# ^
# file.*.cpp:14:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
# for(size_t col = 0; col < xcols; col++) {
# ^
请注意警告。对int
和size_t
使用row
代替col
,我可以获得一些改进。除此之外,我找不到太大的改进空间。
这是我的代码,基准和可重复的示例:
require(Rcpp)
require(microbenchmark)
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), times=1000)
# Unit: microseconds
# expr min lq median uq max neval
# hm(m) 578.844 594.1460 607.357 636.4410 858.347 1000
# hm3(m) 6389.014 6452.9595 6476.197 6735.5465 33720.870 1000
# hm_jmu(m) 409.920 415.0395 424.401 449.0075 650.127 1000
答案 2 :(得分:1)
我非常好奇为什么'烘焙'被定义为const
会有所作为;所以我玩弄了这个想法。
library(Rcpp)
library(microbenchmark)
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == 10){
n_all_true++;
}
}
return n_all_true;
}')
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
我只是把Joshua的sol'n带到这里,但产生了量身定制的功能 通过code-gen在我的机器上运行良好。 这对我来说似乎很烦人,但我 以为我会发帖:
macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
cppFunction(macroExpand(10L))
set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m), times=1000)
#> Unit: microseconds
#> expr min lq mean median uq max
#> hm(m) 596.808 600.1870 722.5140 629.1750 709.3875 1680.379
#> hm3(m) 2189.164 2353.6700 2972.1463 2509.4630 2956.7675 49930.471
#> hm_jmu(m) 574.137 576.5160 678.6475 600.4775 665.2800 2240.988
#> hm_npjc(m) 81.978 83.1855 102.7646 89.2160 101.0400 380.884
#> neval
#> 1000
#> 1000
#> 1000
#> 1000
我想请注意,我真的不明白为什么编译器不会优化到同一解决方案;如果有人对此有所了解那将是非常棒的。
devtools::session_info()
#> Session info --------------------------------------------------------------
#> setting value
#> version R version 3.2.2 (2015-08-14)
#> system x86_64, darwin13.4.0
#> ui RStudio (0.99.691)
#> language (EN)
#> collate en_CA.UTF-8
#> tz America/Los_Angeles
#> date 2015-09-27
#> Packages ------------------------------------------------------------------
#> package * version date source
#> clipr 0.1.1 2015-09-04 CRAN (R 3.2.0)
#> colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
#> devtools 1.9.1 2015-09-11 CRAN (R 3.2.0)
#> digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
#> evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
#> formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
#> ggplot2 1.0.1 2015-03-17 CRAN (R 3.2.0)
#> gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
#> htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
#> knitr 1.10.5 2015-05-06 CRAN (R 3.2.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
#> MASS 7.3-43 2015-07-16 CRAN (R 3.2.2)
#> memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
#> microbenchmark * 1.4-2 2014-09-28 CRAN (R 3.2.0)
#> munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
#> plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
#> proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
#> Rcpp * 0.12.1 2015-09-10 CRAN (R 3.2.0)
#> reprex 0.0.0.9001 2015-09-26 Github (jennybc/reprex@1d6584a)
#> reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
#> rmarkdown 0.7 2015-06-13 CRAN (R 3.2.0)
#> rstudioapi 0.3.1 2015-04-07 CRAN (R 3.2.0)
#> scales 0.3.0 2015-08-25 CRAN (R 3.2.0)
#> stringi 0.5-5 2015-06-29 CRAN (R 3.2.0)
#> stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
答案 3 :(得分:0)
对于许多数字运算符,利用TRUE
被强制转换为1
的事实如何,然后它已经在已经用C编程的函数中进行了矢量化。例如。
set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
sum(rowSums(m) == ncol(m))
## [1] 47
microbenchmark::microbenchmark(sum(rowSums(m) == ncol(m)))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum(rowSums(m) == ncol(m)) 1.715399 1.840763 1.873422 1.861552 1.905841 2.02524 100
请参阅R Inferno第3章。
直接比较编辑回答:
(这里我将两个C ++函数粘贴到我桌面上名为test.cpp
的文件中,并带有常用的Rcpp标题信息)
require(Rcpp)
sourceCpp("~/Desktop/test.cpp")
set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
microbenchmark::microbenchmark(hm(m), hm2(m), hm3(m), times = 1000)
## Unit: milliseconds
## expr min lq mean median uq max neval
## hm(m) 4.996005 5.036732 5.169672 5.089707 5.194580 9.961581 1000
## hm2(m) 5.031222 5.074990 5.228239 5.128106 5.242909 10.109776 1000
## hm3(m) 1.626933 1.878014 2.205195 1.922608 2.014012 226.894190 1000
我在这里注意到对R Inferno的引用并不合适,因为它不适用于C ++,但它仍然是生活的口头禅。 : - )