将带有t(apply())的for循环从R转换为C ++以与Rcpp一起使用

时间:2019-06-10 18:27:32

标签: apply rcpp transpose

我正在尝试将用R编写的for循环转换为C ++,以便与Rcpp一起使用;特别是带有转置的'apply'类型的函数。

该函数获取一个.gen文件并将其转换为等位基因:

我已经阅读了尼克·乌勒(Nick Ulle)的Rcpp入门以及Masaki E. Tsuda({3}}和Rcpp4Everyone的大部分内容,以了解自己的现状。

这是R代码:

library(tidyverse)

geno <- data.frame(x1 = c(1,1,1),
                 x2 = c("rs001", "rs002", "rs003"),
                 x3 = c(224422,225108,225167),
                 x4 = c("T","A", "G"),
                 x5 = c("C", "C", "A"),
                 x6 = c(1,1,1),
                 x7 = c(0,0,0),
                 x8 = c(0,0,0),
                 x9 = c(1,0,1),
                 x10 = c(0,1,0),
                 x11 = c(0,0,0),
                 stringsAsFactors = F)

# What I'd like to turn into C++
geno_to_alleles <- function(geno) {
        # Pre-allocate final output - always initialize output variable to required length and data type
        tmp = matrix(nrow = (ncol(geno)-5)/3, ncol = nrow(geno), byrow= T)
        #j is subject index
        j =1
        for (i in seq(from=6,to=ncol(geno), by=3)){
                tmp[j,1:nrow(geno)] <- t(apply(geno[, i:(i+2)], 1, paste, collapse = ""))
                j = j + 1
        }
        return(tmp)
}

df_out <- geno_to_alleles(df)

结果输出为matrix,如下所示:

     [,1]  [,2]  [,3] 
[1,] "100" "100" "100"
[2,] "100" "010" "100"

到目前为止,我拥有以下C ++代码,它们读取DataFrame并创建一个ComplexMatrix对象,该对象将根据输入DataFrame的大小而变化。

我需要帮助的是将以下代码转换为C ++ tmp[j,1:nrow(geno)] <- t(apply(geno[, i:(i+2)], 1, paste, collapse = ""))

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {
        int input_rows = df.nrow(); // output
        int input_cols = df.ncol();
        Rcout << "Input DataFrame df has " << input_rows << " rows and "  << input_cols << " columns." << std::endl;

        int total_rows = (input_cols-5)/3;

        ComplexMatrix tmp(total_rows, input_rows);
        Rcout << "Output ComplexMatrix tmp has " << total_rows << " rows and "  << input_rows << " columns." << std::endl;


        // Below needs to be transpiled into C++
        //tmp[j,1:nrow(df)] <- t(apply(df[, i:(i+2)], 1, paste, collapse = ""))

        // return the new data frame
        return tmp;
}

1 个答案:

答案 0 :(得分:2)

您可以使用std::to_string()+的组合来执行此操作。我们有以下C ++代码:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::CharacterMatrix geno_to_alleles_cpp(Rcpp::DataFrame x) {
    // Set up result object
    int n = x.nrow();
    int m = x.ncol();
    Rcpp::CharacterMatrix result( (m - 5) / 3, n );

    // We'll loop over columns in x, at the same time going over rows in result
    for ( int i = 0, j = 5; j < m; ++i, j += 3 ) {
        Rcpp::IntegerVector x1 = Rcpp::as<Rcpp::IntegerVector>(x[j]);
        Rcpp::IntegerVector x2 = Rcpp::as<Rcpp::IntegerVector>(x[j + 1]);
        Rcpp::IntegerVector x3 = Rcpp::as<Rcpp::IntegerVector>(x[j + 2]);
        // Then we go over the columns in result / rows in x
        for ( int k = 0; k < n; ++k ) {
            result(i, k) = std::to_string(x1[k]) + std::to_string(x2[k])
                           + std::to_string(x3[k]);
        }
    }

    return result;
}

这正是我们想要的:

geno <- data.frame(x1 = c(1,1,1),
                   x2 = c("rs001", "rs002", "rs003"),
                   x3 = c(224422,225108,225167),
                   x4 = c("T","A", "G"),
                   x5 = c("C", "C", "A"),
                   x6 = c(1,1,1),
                   x7 = c(0,0,0),
                   x8 = c(0,0,0),
                   x9 = c(1,0,1),
                   x10 = c(0,1,0),
                   x11 = c(0,0,0),
                   stringsAsFactors = F)

geno_to_alleles <- function(geno) {
    # Pre-allocate final output - always initialize output variable to required length and data type
    tmp = matrix(nrow = (ncol(geno)-5)/3, ncol = nrow(geno), byrow= T)
    #j is subject index
    j =1
    for (i in seq(from=6,to=ncol(geno), by=3)){
        tmp[j,1:nrow(geno)] <- t(apply(geno[, i:(i+2)], 1, paste, collapse = ""))
        j = j + 1
    }
    return(tmp)
}

Rcpp::sourceCpp("geno_to_alleles_cpp.cpp")
geno_to_alleles(geno)
#      [,1]  [,2]  [,3] 
# [1,] "100" "100" "100"
# [2,] "100" "010" "100"
geno_to_alleles_cpp(geno)
#      [,1]  [,2]  [,3] 
# [1,] "100" "100" "100"
# [2,] "100" "010" "100"

而且,至少有了这些数据,它比基数R快得多(我还没有对它的缩放比例做过任何检查):

library(microbenchmark)
microbenchmark(base = geno_to_alleles(geno), rcpp = geno_to_alleles_cpp(geno))

Unit: microseconds
 expr      min        lq       mean    median        uq      max neval
 base 1296.948 1305.4190 1328.34660 1316.4780 1340.8675 1573.943   100
 rcpp   33.893   35.5445   77.57828   38.9405   41.0365 3851.134   100