Data.table正则表达式按组重叠文本

时间:2018-02-08 14:42:14

标签: r regex text data.table

好的,我有一个相对复杂的。一个data.table解决方案将是最受欢迎的,但任何事情都是真的。只需复制粘贴inputoutput data.tables的可重现示例即可。

我想按uniqueID进行分组,但我希望所有Description行与相同的群组匹配任何重叠的字词或短语并仅将其分配给保留的记录。希望这个例子是自我解释的。重要的一点是,我对单词或短语出现的顺序无动于衷。

示例:

> input_x
      uniqueID Sourced_from                                                                               Description
1: RandomHash1          DB1                                           This is an example of what I would like to keep
2: RandomHash1          DB1            That is another example of what I would like to keep -; random text added here
3: RandomHash2          DB2 All of these examples depend on the uniqueID and I need to only keep the overlapping part
4: RandomHash2          DB2                                                                          Overlapping part
5: RandomHash3          DB1                 This should be on its own because its hash is non associated with another
> output_x
      uniqueID Sourced_from                                                               Description
1: RandomHash1          DB1                                   is example of what I would like to keep
2: RandomHash2          DB2                                                          Overlapping part
3: RandomHash3          DB1 This should be on its own because its hash is non associated with another

可重复的示例代码:

library(data.table)
    input_x <- setDT(structure(list(uniqueID     = c("RandomHash1",    "RandomHash1", "RandomHash2", "RandomHash2",  "RandomHash3" ), 
                    Sourced_from = c("DB1", "DB1",   "DB2",    "DB2",   "DB1" ),
                    Description  = c("This is an example of what I would like to keep",
                                     "That is another example of what I would like to keep -; random text added here",
                                     "All of these examples depend on the uniqueID and I need to only keep the overlapping part",
                                     "Overlapping part",
                                     "This should be on its own because its hash is non associated with another")
),
.Names    = c("uniqueID", "Sourced_from", "Description"),
class     = "data.frame",
row.names = c(NA, -5L)
))

output_x <- setDT(structure(list(uniqueID    = c("RandomHash1", "RandomHash2",  "RandomHash3" ), 
                          Sourced_from = c("DB1", "DB2",   "DB1" ),
                          Description  = c("is example of what I would like to keep",
                                           "Overlapping part",
                                           "This should be on its own because its hash is non associated with another")
),
.Names    = c("uniqueID", "Sourced_from", "Description"),
class     = "data.frame",
row.names = c(NA, -3L)
))

1 个答案:

答案 0 :(得分:1)

我们可以创建一个函数来分割字符串并将其相交以查找常用字词,并使用data.table来应用它,即

library(data.table)

f1 <- function(x) {
    i1 <- Reduce(intersect, strsplit(tolower(x), split = '[[:punct:]]|\\s'))
     return(paste(i1, collapse = ' '))
}


input_x[, .(Description = f1(Description)), by = .(uniqueID, Sourced_from)][]

给出,

      uniqueID Sourced_from                                                               Description
1: RandomHash1          DB1                                   is example of what i would like to keep
2: RandomHash2          DB2                                                          overlapping part
3: RandomHash3          DB1 this should be on its own because its hash is non associated with another