大家好我想用for循环优化R代码,因为它需要花费很多时间来执行。我甚至尝试在R中使用编译器将函数转换为字节码,但性能更差。那么,有没有办法用应用函数编写这段代码
word_separation<-function(inp_data){
df<-NULL
for(k in 1:nrow(inp_data)){
vec<-unlist(strsplit(as.vector(inp_data[k,]),split=","))
if(length(vec)==1){
df<-rbind(df,data.frame(first_col=vec,second_col=vec))
}else{
temp_df<-NULL
for(i in 2:length(vec)){
for(j in i:length(vec){
temp_df<-rbind(temp_df,data.frame(first_col=vec[1],second_col=paste(vec[i:j],collapse=",")))
}
df<-rbind(df,temp_df)
df[df==""]<-NA
df<-df %>% unique() %>% na.omit()
}
}
}
return(df)
}
这里我的inp_data数据框包含带有数据的单列
column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
传递给函数时,返回一个包含列的数据框,第一列包含第一个单词,第二列包含数据框中其他单词的组合。
first_col second_col
Milk Bread
Milk Bread,Eggs
Milk Bread,Eggs,Jam
Milk Eggs
Milk Eggs,Jam
Milk Jam
Apple Milk
Apple Milk,Beer
Apple Beer
答案 0 :(得分:3)
OP已指定输入数据由单个列组成。因此我们需要在创建组合之前拆分列。 (The answer given by Sathish默默地跳过了这一步。)
以下data.table
解决方案仅使用一个lapply()
。
修改:添加只有一个字的行
library(data.table)
inp_data <- fread(" column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
Butter", sep = "\n")
# split strings, output in long format, add row number for later join
molten <- inp_data[, rn := .I][, strsplit(column, ","), by = rn]
# create combinations of all words (except the first one)
combined <- molten[, unlist(
lapply(seq_len(.N - 1), function(.i) as.data.table(
combn(V1[-1], .i, paste, collapse = ",", simplify = TRUE)))), by = rn]
# right join
combined[molten[, .(rn, first_col = first(V1)), by = rn],
.(rn, first_col, second_col = V1), on = "rn"]
# rn first_col second_col
# 1: 1 Milk Bread
# 2: 1 Milk Eggs
# 3: 1 Milk Jam
# 4: 1 Milk Bread,Eggs
# 5: 1 Milk Bread,Jam
# 6: 1 Milk Eggs,Jam
# 7: 1 Milk Bread,Eggs,Jam
# 8: 2 Apple Milk
# 9: 2 Apple Beer
#10: 2 Apple Milk,Beer
#11: 3 Butter NA
修改:更改了联接以确保包含仅包含一个单词的行。