Question

是否可以编写一个C ++函数来获取R dataFrame作为输入，然后修改dataFrame（在我们的例子中采用一个子集）并返回新的数据框（在这个问题中，返回一个子数据帧）？我的下面的代码可能会使我的问题更加明确：

码：

# Suppose I have the data frame below created in R:
myDF = data.frame(id = rep(c(1,2), each = 5), alph = letters[1:10], mess = rnorm(10))

# Suppose I want to write a C++ function that gets id as inout and returns 
# a sub-dataframe corresponding to that id (**If it's possible to return 
# DataFrame in C++**)

# Auxiliary function --> helps get a sub vector:
arma::vec myVecSubset(arma::vec vecMain, arma::vec IDVec, int ID){
  arma::uvec AuxVec = find(IDVec == ID);
  arma::vec rslt = arma::vec(AuxVec.size());
  for (int i = 0; i < AuxVec.size(); i++){
    rslt[i] = vecMain[AuxVec[i]];
  }
  return rslt;
}

# Here is my C++ function:
Rcpp::DataFrame myVecSubset(Rcpp::DataFrame myDF, int ID){
  arma::vec id = Rcpp::as<arma::vec>(myDF["id"]);
  arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]);
  arma::vec mess = Rcpp::as<arma::vec>(myDF["mess"]);

  // here I take a sub-vector:
  arma::vec id_sub = myVecSubset(id, id, int ID);
  arma::vec alph_sub = myVecSubset(alph, id, int ID);
  arma::vec mess_sub = myVecSubset(mess, id, int ID);

  // here is the CHALLENGE: How to combine these vectors into a new data frame???
  ???
}

总之，实际上有两个主要问题： 1）有没有更好的方法在C ++中使用上面的子数据帧？（希望我能简单地说myDF [myDF $ id == ID，] !!!）

2）无论如何，我可以将id_sub，alpha_sub和mess_sub组合成R数据帧并返回吗？

我非常感谢你的帮助。

Answer 1

要添加到Romain的答案，您可以尝试通过Rcpp调用[运算符。如果我们了解df[x, ]的评估方式（即，它实际上是对"[.data.frame"(df, x, R_MissingArg)的调用，这很容易做到：

#include <Rcpp.h>
using namespace Rcpp;

Function subset("[.data.frame");

// [[Rcpp::export]]
DataFrame subset_test(DataFrame x, IntegerVector y) {
  return subset(x, y, R_MissingArg);
}

/*** R
df <- data.frame(x=1:3, y=letters[1:3])
subset_test(df, c(1L, 2L))
*/

给了我

> df <- data.frame(x=1:3, y=letters[1:3])
> subset_test(df, c(1L, 2L))
  x y
1 1 a
2 2 b

对R的回调通常在Rcpp中较慢，但取决于瓶颈的多少，这对你来说仍然足够快。

但要小心，因为此函数将使用基于1的子集而不是基于0的子集用于整数向量。

Answer 2

您不需要Rcpp和RcppArmadillo，您可以使用R subset或dplyr::filter。这可能比您的代码更有效，因为代码必须将数据从数据框深度复制到犰狳矢量，创建新的犰狳矢量，然后将它们复制回R矢量，以便您可以构建数据帧。这会产生大量浪费。废物的另一个来源是你find相同的三倍

无论如何，要回答您的问题，请使用DataFrame::create。

DataFrame::create( _["id"] = id_sub, _["alpha"] = alph_dub, _["mess"] = mess_sub ) ;

另请注意，在您的代码中，alpha将是一个因素，因此arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]);不太可能按您的意愿执行。

Answer 3

这是一个完整的测试文件。它不需要你的提取器函数，只需要重新组装子集 - 但是为了它需要最新的Rcpp，就像目前在GitHub上一样，凯文碰巧在子集索引上增加了一些工作，这正是我们在这里所需要的：

#include <Rcpp.h>

/*** R
##  Suppose I have the data frame below created in R:
##  NB: stringsAsFactors set to FALSE
##  NB: setting seed as well
set.seed(42)
myDF <- data.frame(id = rep(c(1,2), each = 5), 
                   alph = letters[1:10], 
                   mess = rnorm(10), 
                   stringsAsFactor=FALSE)
*/

// [[Rcpp::export]]
Rcpp::DataFrame extract(Rcpp::DataFrame D, Rcpp::IntegerVector idx) {

  Rcpp::IntegerVector     id = D["id"];
  Rcpp::CharacterVector alph = D["alph"];
  Rcpp::NumericVector   mess = D["mess"];

  return Rcpp::DataFrame::create(Rcpp::Named("id")    = id[idx],
                                 Rcpp::Named("alpha") = alph[idx],
                                 Rcpp::Named("mess")  = mess[idx]);
}

/*** R
extract(myDF, c(2,4,6,8))
*/

使用该文件，我们得到预期的结果：

R> library(Rcpp)
R> sourceCpp("/tmp/sepher.cpp")

R> ##  Suppose I have the data frame below created in R:
R> ##  NB: stringsAsFactors set to FALSE
R> ##  NB: setting seed as well
R> set.seed(42)

R> myDF <- data.frame(id = rep(c(1,2), each = 5), 
+                    alph = letters[1:10], 
+                    mess = rnorm(10), 
+               .... [TRUNCATED] 

R> extract(myDF, c(2,4,6,8))
  id alpha     mess
1  1     c 0.363128
2  1     e 0.404268
3  2     g 1.511522
4  2     i 2.018424
R>
R> packageDescription("Rcpp")$Version   ## unreleased version
[1] "0.11.1.1"
R>

几周前我只需要类似的东西（但不涉及字符向量），并使用elem()函数使用unsigned int函数作为索引使用Armadillo。

Rcpp函数用于选择（和返回）子数据帧

3 个答案: