为什么rcpp函数变慢?

时间:2020-01-29 19:51:13

标签: rcpp

我做了一个简单的函数,如果在Accounting_Transaction_ID列中(即在其他行中)的任何地方都存在Reversal_Accounting_Transaction_ID中的值,则将一列reversal_indicator设置为“ yes”。Reversal_Accounting_Transaction_ID列中的大多数条目可能为空白,并且因此应为“否”。

该数据帧是由一个6gb的csv文件创建的(假设大约有600万行),并在数据块上进行处理。

我不太确定为什么要花这么长时间

Rcpp::cppFunction('
std::vector<std::string>
reversals(DataFrame frame)
{
  std::vector<std::string> Accounting_Transaction_ID = as<std::vector<std::string> >(frame["BELNR"]);
  std::vector<std::string> Reversal_Accounting_Transaction_ID = as<std::vector<std::string> >(frame["STBLG"]);
  std::vector<std::string> ReversalIndicator(Reversal_Accounting_Transaction_ID.size()) ;

  if (Reversal_Accounting_Transaction_ID.size() == 0) {
    return ReversalIndicator;
  }
  int dfSize = Reversal_Accounting_Transaction_ID.size();
  for (int i = 0; i < dfSize; ++i) {
    if (Reversal_Accounting_Transaction_ID[i] != "") {
      for (int j = 1; j < dfSize; ++j) {
        if(Accounting_Transaction_ID[j]== Reversal_Accounting_Transaction_ID[i]){
            ReversalIndicator[i]="Yes";
            break;
                                                                                }
         else if( (j== dfSize -1)  ){
                 ReversalIndicator[i]="No";
                                                                                                            }
                                      }
                                                    }
   else{
      ReversalIndicator[i]="No";
       }
                                  }
  return ReversalIndicator;
}

')```

```df$reversal=reversals(df)```

2 个答案:

答案 0 :(得分:4)

您正在遍历数据帧的每一行,即您有大约6m x 6m的操作(O(N ^ 2))。这可能需要一段时间。但是,您可以从O(N ^ 2)转到O(N),但会占用一些内存。没有任何示例数据,我将无法对此进行测试,因此我仅提供了一些伪代码:

create empty set data structure

for each row in df:
     Add Reversal_Accounting_Transaction_ID to set

for each row in df:
     if Accounting_Transaction_ID can be found in set
         ReversalIndicator = "Yes"

答案 1 :(得分:1)

基于拉尔夫斯答案

不确定我是否需要在开始时分配尺寸吗?

Rcpp::cppFunction('
std::vector<std::string> reversals(DataFrame frame)
{
  std::vector<std::string> Accounting_Transaction_ID = as<std::vector<std::string> >(frame["BELNR"]);
  std::vector<std::string> Reversal_Accounting_Transaction_ID = as<std::vector<std::string> >(frame["STBLG"]);
  std::vector<std::string> ReversalIndicator(Reversal_Accounting_Transaction_ID.size()) ;

  std::set<std::string> uniqueTransID;


  if (Reversal_Accounting_Transaction_ID.size() == 0) {
    return ReversalIndicator;
  }
  int dfSize = Reversal_Accounting_Transaction_ID.size();


 for (int i = 0; i < dfSize; ++i) {
    uniqueTransID.insert(Accounting_Transaction_ID[i]);
 }

  for (int i = 0; i < dfSize; ++i) {

      if (Reversal_Accounting_Transaction_ID[i] !=""){

             ReversalIndicator[i]="No";
             continue;
                                                         }
    if (uniqueTransID.find(Reversal_Accounting_Transaction_ID[i]) != uniqueTransID.end()) {

                 ReversalIndicator[i]="Yes";

                                      }

   else{
      ReversalIndicator[i]="No";

       }
                                  }
  return ReversalIndicator;
}


')```