我做了一个简单的函数,如果在Accounting_Transaction_ID列中(即在其他行中)的任何地方都存在Reversal_Accounting_Transaction_ID中的值,则将一列reversal_indicator设置为“ yes”。Reversal_Accounting_Transaction_ID列中的大多数条目可能为空白,并且因此应为“否”。
该数据帧是由一个6gb的csv文件创建的(假设大约有600万行),并在数据块上进行处理。
我不太确定为什么要花这么长时间
Rcpp::cppFunction('
std::vector<std::string>
reversals(DataFrame frame)
{
std::vector<std::string> Accounting_Transaction_ID = as<std::vector<std::string> >(frame["BELNR"]);
std::vector<std::string> Reversal_Accounting_Transaction_ID = as<std::vector<std::string> >(frame["STBLG"]);
std::vector<std::string> ReversalIndicator(Reversal_Accounting_Transaction_ID.size()) ;
if (Reversal_Accounting_Transaction_ID.size() == 0) {
return ReversalIndicator;
}
int dfSize = Reversal_Accounting_Transaction_ID.size();
for (int i = 0; i < dfSize; ++i) {
if (Reversal_Accounting_Transaction_ID[i] != "") {
for (int j = 1; j < dfSize; ++j) {
if(Accounting_Transaction_ID[j]== Reversal_Accounting_Transaction_ID[i]){
ReversalIndicator[i]="Yes";
break;
}
else if( (j== dfSize -1) ){
ReversalIndicator[i]="No";
}
}
}
else{
ReversalIndicator[i]="No";
}
}
return ReversalIndicator;
}
')```
```df$reversal=reversals(df)```
答案 0 :(得分:4)
您正在遍历数据帧的每一行,即您有大约6m x 6m的操作(O(N ^ 2))。这可能需要一段时间。但是,您可以从O(N ^ 2)转到O(N),但会占用一些内存。没有任何示例数据,我将无法对此进行测试,因此我仅提供了一些伪代码:
create empty set data structure
for each row in df:
Add Reversal_Accounting_Transaction_ID to set
for each row in df:
if Accounting_Transaction_ID can be found in set
ReversalIndicator = "Yes"
答案 1 :(得分:1)
基于拉尔夫斯答案
不确定我是否需要在开始时分配尺寸吗?
Rcpp::cppFunction('
std::vector<std::string> reversals(DataFrame frame)
{
std::vector<std::string> Accounting_Transaction_ID = as<std::vector<std::string> >(frame["BELNR"]);
std::vector<std::string> Reversal_Accounting_Transaction_ID = as<std::vector<std::string> >(frame["STBLG"]);
std::vector<std::string> ReversalIndicator(Reversal_Accounting_Transaction_ID.size()) ;
std::set<std::string> uniqueTransID;
if (Reversal_Accounting_Transaction_ID.size() == 0) {
return ReversalIndicator;
}
int dfSize = Reversal_Accounting_Transaction_ID.size();
for (int i = 0; i < dfSize; ++i) {
uniqueTransID.insert(Accounting_Transaction_ID[i]);
}
for (int i = 0; i < dfSize; ++i) {
if (Reversal_Accounting_Transaction_ID[i] !=""){
ReversalIndicator[i]="No";
continue;
}
if (uniqueTransID.find(Reversal_Accounting_Transaction_ID[i]) != uniqueTransID.end()) {
ReversalIndicator[i]="Yes";
}
else{
ReversalIndicator[i]="No";
}
}
return ReversalIndicator;
}
')```