好的,我有一个相对复杂的。一个data.table
解决方案将是最受欢迎的,但任何事情都是真的。只需复制粘贴input
和output
data.tables的可重现示例即可。
我想按uniqueID
进行分组,但我希望所有Description
行与相同的群组匹配任何重叠的字词或短语并仅将其分配给保留的记录。希望这个例子是自我解释的。重要的一点是,我对单词或短语出现的顺序无动于衷。
示例:
> input_x
uniqueID Sourced_from Description
1: RandomHash1 DB1 This is an example of what I would like to keep
2: RandomHash1 DB1 That is another example of what I would like to keep -; random text added here
3: RandomHash2 DB2 All of these examples depend on the uniqueID and I need to only keep the overlapping part
4: RandomHash2 DB2 Overlapping part
5: RandomHash3 DB1 This should be on its own because its hash is non associated with another
> output_x
uniqueID Sourced_from Description
1: RandomHash1 DB1 is example of what I would like to keep
2: RandomHash2 DB2 Overlapping part
3: RandomHash3 DB1 This should be on its own because its hash is non associated with another
可重复的示例代码:
library(data.table)
input_x <- setDT(structure(list(uniqueID = c("RandomHash1", "RandomHash1", "RandomHash2", "RandomHash2", "RandomHash3" ),
Sourced_from = c("DB1", "DB1", "DB2", "DB2", "DB1" ),
Description = c("This is an example of what I would like to keep",
"That is another example of what I would like to keep -; random text added here",
"All of these examples depend on the uniqueID and I need to only keep the overlapping part",
"Overlapping part",
"This should be on its own because its hash is non associated with another")
),
.Names = c("uniqueID", "Sourced_from", "Description"),
class = "data.frame",
row.names = c(NA, -5L)
))
output_x <- setDT(structure(list(uniqueID = c("RandomHash1", "RandomHash2", "RandomHash3" ),
Sourced_from = c("DB1", "DB2", "DB1" ),
Description = c("is example of what I would like to keep",
"Overlapping part",
"This should be on its own because its hash is non associated with another")
),
.Names = c("uniqueID", "Sourced_from", "Description"),
class = "data.frame",
row.names = c(NA, -3L)
))
答案 0 :(得分:1)
我们可以创建一个函数来分割字符串并将其相交以查找常用字词,并使用data.table
来应用它,即
library(data.table)
f1 <- function(x) {
i1 <- Reduce(intersect, strsplit(tolower(x), split = '[[:punct:]]|\\s'))
return(paste(i1, collapse = ' '))
}
input_x[, .(Description = f1(Description)), by = .(uniqueID, Sourced_from)][]
给出,
uniqueID Sourced_from Description 1: RandomHash1 DB1 is example of what i would like to keep 2: RandomHash2 DB2 overlapping part 3: RandomHash3 DB1 this should be on its own because its hash is non associated with another