Question

我正在尝试编写一个函数来计算字符串向量中唯一项的数量（我的问题稍微复杂一些，但这是可重现的。我根据我在C ++中找到的答案做了这个。这是我的代码：

C ++

int unique_sort(vector<string> x) {
    sort(x.begin(), x.end());
    return unique(x.begin(), x.end()) - x.begin();
}

int unique_set(vector<string> x) {
    unordered_set<string> tab(x.begin(), x.end());
    return tab.size();
}

R：

x <- paste0("x", sample(1:1e5, 1e7, replace=T))
microbenchmark(length(unique(x)),unique_sort(x), unique_set(x), times=3)

结果：

Unit: milliseconds
              expr        min         lq       mean     median         uq
 length(unique(x))   365.0213   373.4018   406.0209   381.7823   426.5206
    unique_sort(x) 10732.1918 10847.0532 10907.6882 10961.9146 10995.4363
     unique_set(x)  1948.6517  2230.3383  2334.4040  2512.0249  2527.2802

查看unique函数的R源代码（它有点难以理解），它似乎在数组上使用循环向哈希添加唯一元素并检查哈希是否已经元素存在。

因此，我认为它应该等同于unordered_set方法。我不明白为什么unordered_set方法慢了5倍。

TLDR：为什么我的C ++代码很慢？

Answer 1

首先，请将示例重现。以上缺少Rcpp属性， C ++ 11 插件和必要的标头导入。

其次，此处显示的问题是执行从 R 到 C ++ 的深层复制数据的成本结构体。您的基准测试中的大部分时间都花在了复制过程中。选择使用std::vector<std::string>而不是Rcpp::CharacterVector来触发此过程，SEXP包含.sort()，s表达式或指向数据的指针。通过否定Rcpp对象提供的代理模型（仅执行浅拷贝），将数据导入 C ++ 的成本会立即降低，远远大于仅在微秒内描述的微秒Why is this simplistic cpp function version slower?。

话虽如此，让我们谈谈如何修改上面的例子来使用Rcpp对象。首先，请注意Rcpp对象有一个名为Rcpp::CharacterVector的成员函数，可以准确地对SEXP的缺失值进行排序（有关详细信息，请参阅Rcpp FAQ Section 5.5: Sorting with STL on a CharacterVector produces problematic results，假设没有大小写或特殊区域设置）。其次，即使将数据导入为std::unordered_set，Rcpp::CharacterVector类型也可用作构建#include <Rcpp.h> #include <unordered_set> #include <algorithm> // [[Rcpp::plugins(cpp11)]] // [[Rcpp::export]] int unique_sort(std::vector<std::string> x) { sort(x.begin(), x.end()); return unique(x.begin(), x.end()) - x.begin(); } // [[Rcpp::export]] int unique_set(std::vector<std::string> x) { std::unordered_set<std::string> tab(x.begin(), x.end()); return tab.size(); } // [[Rcpp::export]] int unique_sort_native(Rcpp::CharacterVector x) { x.sort(); return std::unique(x.begin(), x.end()) - x.begin(); } // [[Rcpp::export]] int unique_set_native(Rcpp::CharacterVector x) { std::unordered_set<SEXP> tab(x.begin(), x.end()); return tab.size(); }的方法。这些修改可以在声明中带有“native”的 C ++ 函数中找到。

# install.packages(c("microbenchmark"))

# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))

# Run a microbenchmark
microbenchmark::microbenchmark(
  length(unique(x)),
  length(unique.default(x)),
  unique_sort(x),
  unique_set(x),
  unique_sort_native(x),
  unique_set_native(x),
  times = 10
)

测试代码：

Unit: milliseconds
                      expr     min      lq    mean  median      uq     max neval
         length(unique(x))   208.0   235.3   235.7   237.2   240.2   247.4    10
 length(unique.default(x))   230.9   232.8   238.8   233.7   241.8   266.6    10
            unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7    10
             unique_set(x)  2528.1  2545.3  2590.1  2590.3  2631.3  2670.1    10
     unique_sort_native(x)  7452.6  7482.4  7568.5  7509.0  7563.6  7917.8    10
      unique_set_native(x)   175.8   176.9   179.2   178.3   182.3   183.4    10

输出：

unique_set_native

因此，当使用Rcpp对象避免深层复制时，length(unique())函数会将Class: class Client { get Name() { return this.mypush(); } mypush(){ return "Imran"; } }调用超过30毫秒。

为什么我的Rcpp实现用于查找比基本R慢的唯一项目数？

1 个答案: