编辑我有一个像这样的输入数据框:
我希望输出如下:
请在下面找到我的解释。我完全不知道给出更详细的解释:(
让我解释....在输入数据集中,对于COL1值为“10”的行,我想扫描COL2值并用“*”替换任何重复的文本模式...同样的逻辑去对于具有重复COL1值的所有COL2值。 我想为此使用gsub函数..
我尝试了gsub和粘贴几次并且没有得到所需的输出,因为我不知道如何匹配重复内部的所有模式。
我已经问过这个问题。但由于我没有得到答案,我正在重新发布。
在下面附加输入数据框的输入:
structure(list(COL1 = c(10L, 10L, 10L, 20L, 20L, 30L, 30L, 40L,
40L, 40L, 50L, 50L, 50L), COL2 = c("mary has life", "Don mary has life",
"Britto mary has life", "push them fur", "push them ", "yell at this",
"this is yell at this", "Year", "Doggy", "Horse", "This is great job",
"great job", "Donkey")), .Names = c("COL1", "COL2"), row.names = c(NA,
-13L), class = "data.frame")
答案 0 :(得分:4)
您可以编写一个函数来为组中的每个项目运行gsub
并选择最短的替换(当然除了它自己):
fun <- function(col){
matches <- sapply(col, function(x){gsub(x, '\\*', col)});
diag(matches) <- NA;
apply(matches, 1, function(x){x[which.min(nchar(x))]})
}
现在用你最喜欢的语法实现:
library(dplyr)
df %>% group_by(COL1) %>% mutate(COL3 = fun(COL2))
## Source: local data frame [13 x 3]
## Groups: COL1 [5]
##
## COL1 COL2 COL3
## <int> <chr> <chr>
## 1 10 mary has life mary has life
## 2 10 Don mary has life Don *
## 3 10 Britto mary has life Britto *
## 4 20 push them fur *fur
## 5 20 push them push them
## 6 30 yell at this yell at this
## 7 30 this is yell at this this is *
## 8 40 Year Year
## 9 40 Doggy Doggy
## 10 40 Horse Horse
## 11 50 This is great job This is *
## 12 50 great job great job
## 13 50 Donkey Donkey
或者将它全部保存在R:
中df$COL3 <- ave(df$COL2, df$COL1, FUN = fun)
df
## COL1 COL2 COL3
## 1 10 mary has life mary has life
## 2 10 Don mary has life Don *
## 3 10 Britto mary has life Britto *
## 4 20 push them fur *fur
## 5 20 push them push them
## 6 30 yell at this yell at this
## 7 30 this is yell at this this is *
## 8 40 Year Year
## 9 40 Doggy Doggy
## 10 40 Horse Horse
## 11 50 This is great job This is *
## 12 50 great job great job
## 13 50 Donkey Donkey