我正在尝试将用R编写的for
循环转换为C ++,以便与Rcpp
一起使用;特别是带有转置的'apply'类型的函数。
该函数获取一个.gen
文件并将其转换为等位基因:
我已经阅读了尼克·乌勒(Nick Ulle)的Rcpp入门以及Masaki E. Tsuda({3}}和Rcpp4Everyone的大部分内容,以了解自己的现状。
这是R
代码:
library(tidyverse)
geno <- data.frame(x1 = c(1,1,1),
x2 = c("rs001", "rs002", "rs003"),
x3 = c(224422,225108,225167),
x4 = c("T","A", "G"),
x5 = c("C", "C", "A"),
x6 = c(1,1,1),
x7 = c(0,0,0),
x8 = c(0,0,0),
x9 = c(1,0,1),
x10 = c(0,1,0),
x11 = c(0,0,0),
stringsAsFactors = F)
# What I'd like to turn into C++
geno_to_alleles <- function(geno) {
# Pre-allocate final output - always initialize output variable to required length and data type
tmp = matrix(nrow = (ncol(geno)-5)/3, ncol = nrow(geno), byrow= T)
#j is subject index
j =1
for (i in seq(from=6,to=ncol(geno), by=3)){
tmp[j,1:nrow(geno)] <- t(apply(geno[, i:(i+2)], 1, paste, collapse = ""))
j = j + 1
}
return(tmp)
}
df_out <- geno_to_alleles(df)
结果输出为matrix
,如下所示:
[,1] [,2] [,3]
[1,] "100" "100" "100"
[2,] "100" "010" "100"
到目前为止,我拥有以下C ++代码,它们读取DataFrame
并创建一个ComplexMatrix
对象,该对象将根据输入DataFrame
的大小而变化。
我需要帮助的是将以下代码转换为C ++ tmp[j,1:nrow(geno)] <- t(apply(geno[, i:(i+2)], 1, paste, collapse = ""))
:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {
int input_rows = df.nrow(); // output
int input_cols = df.ncol();
Rcout << "Input DataFrame df has " << input_rows << " rows and " << input_cols << " columns." << std::endl;
int total_rows = (input_cols-5)/3;
ComplexMatrix tmp(total_rows, input_rows);
Rcout << "Output ComplexMatrix tmp has " << total_rows << " rows and " << input_rows << " columns." << std::endl;
// Below needs to be transpiled into C++
//tmp[j,1:nrow(df)] <- t(apply(df[, i:(i+2)], 1, paste, collapse = ""))
// return the new data frame
return tmp;
}
答案 0 :(得分:2)
您可以使用std::to_string()
和+
的组合来执行此操作。我们有以下C ++代码:
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::CharacterMatrix geno_to_alleles_cpp(Rcpp::DataFrame x) {
// Set up result object
int n = x.nrow();
int m = x.ncol();
Rcpp::CharacterMatrix result( (m - 5) / 3, n );
// We'll loop over columns in x, at the same time going over rows in result
for ( int i = 0, j = 5; j < m; ++i, j += 3 ) {
Rcpp::IntegerVector x1 = Rcpp::as<Rcpp::IntegerVector>(x[j]);
Rcpp::IntegerVector x2 = Rcpp::as<Rcpp::IntegerVector>(x[j + 1]);
Rcpp::IntegerVector x3 = Rcpp::as<Rcpp::IntegerVector>(x[j + 2]);
// Then we go over the columns in result / rows in x
for ( int k = 0; k < n; ++k ) {
result(i, k) = std::to_string(x1[k]) + std::to_string(x2[k])
+ std::to_string(x3[k]);
}
}
return result;
}
这正是我们想要的:
geno <- data.frame(x1 = c(1,1,1),
x2 = c("rs001", "rs002", "rs003"),
x3 = c(224422,225108,225167),
x4 = c("T","A", "G"),
x5 = c("C", "C", "A"),
x6 = c(1,1,1),
x7 = c(0,0,0),
x8 = c(0,0,0),
x9 = c(1,0,1),
x10 = c(0,1,0),
x11 = c(0,0,0),
stringsAsFactors = F)
geno_to_alleles <- function(geno) {
# Pre-allocate final output - always initialize output variable to required length and data type
tmp = matrix(nrow = (ncol(geno)-5)/3, ncol = nrow(geno), byrow= T)
#j is subject index
j =1
for (i in seq(from=6,to=ncol(geno), by=3)){
tmp[j,1:nrow(geno)] <- t(apply(geno[, i:(i+2)], 1, paste, collapse = ""))
j = j + 1
}
return(tmp)
}
Rcpp::sourceCpp("geno_to_alleles_cpp.cpp")
geno_to_alleles(geno)
# [,1] [,2] [,3]
# [1,] "100" "100" "100"
# [2,] "100" "010" "100"
geno_to_alleles_cpp(geno)
# [,1] [,2] [,3]
# [1,] "100" "100" "100"
# [2,] "100" "010" "100"
而且,至少有了这些数据,它比基数R快得多(我还没有对它的缩放比例做过任何检查):
library(microbenchmark)
microbenchmark(base = geno_to_alleles(geno), rcpp = geno_to_alleles_cpp(geno))
Unit: microseconds
expr min lq mean median uq max neval
base 1296.948 1305.4190 1328.34660 1316.4780 1340.8675 1573.943 100
rcpp 33.893 35.5445 77.57828 38.9405 41.0365 3851.134 100