如何重写这个缓慢的R代码以提高效率?

时间:2017-03-13 16:47:06

标签: r algorithm machine-learning apply

大家好我想用for循环优化R代码,因为它需要花费很多时间来执行。我甚至尝试在R中使用编译器将函数转换为字节码,但性能更差。那么,有没有办法用应用函数编写这段代码

word_separation<-function(inp_data){
df<-NULL
for(k in 1:nrow(inp_data)){
    vec<-unlist(strsplit(as.vector(inp_data[k,]),split=","))
        if(length(vec)==1){
            df<-rbind(df,data.frame(first_col=vec,second_col=vec))
        }else{
            temp_df<-NULL
            for(i in 2:length(vec)){
                for(j in i:length(vec){
                    temp_df<-rbind(temp_df,data.frame(first_col=vec[1],second_col=paste(vec[i:j],collapse=",")))
                }
                df<-rbind(df,temp_df)
                df[df==""]<-NA
                df<-df %>% unique() %>% na.omit()
            }
        }
    }
    return(df)
}

这里我的inp_data数据框包含带有数据的单列

    column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer

传递给函数时,返回一个包含列的数据框,第一列包含第一个单词,第二列包含数据框中其他单词的组合。

 first_col     second_col
   Milk          Bread
   Milk     Bread,Eggs
   Milk Bread,Eggs,Jam
   Milk           Eggs
   Milk       Eggs,Jam
   Milk            Jam
   Apple           Milk
   Apple      Milk,Beer
   Apple           Beer

1 个答案:

答案 0 :(得分:3)

OP已指定输入数据由单个列组成。因此我们需要在创建组合之前拆分列。 (The answer given by Sathish默默地跳过了这一步。)

以下data.table解决方案仅使用一个lapply()

数据

修改:添加只有一个字的行

library(data.table)
inp_data <- fread("    column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
Butter", sep = "\n")

代码

# split strings, output in long format, add row number for later join
molten <- inp_data[, rn := .I][, strsplit(column, ","), by = rn]
# create combinations of all words (except the first one)
combined <- molten[, unlist(
  lapply(seq_len(.N - 1), function(.i) as.data.table(
    combn(V1[-1], .i, paste, collapse = ",", simplify = TRUE)))), by = rn]
# right join
combined[molten[, .(rn, first_col = first(V1)), by = rn], 
         .(rn, first_col, second_col = V1), on = "rn"]
#    rn first_col     second_col
# 1:  1      Milk          Bread
# 2:  1      Milk           Eggs
# 3:  1      Milk            Jam
# 4:  1      Milk     Bread,Eggs
# 5:  1      Milk      Bread,Jam
# 6:  1      Milk       Eggs,Jam
# 7:  1      Milk Bread,Eggs,Jam
# 8:  2     Apple           Milk
# 9:  2     Apple           Beer
#10:  2     Apple      Milk,Beer
#11:  3    Butter             NA

修改:更改了联接以确保包含仅包含一个单词的行。