取消列出,比较和分隔列表列的更快方法,同时保留大数据的行号。

时间:2019-07-12 19:04:37

标签: r data.table

在我的R data.table中,有一列find_tla是一种从文本字段中提取的大写三字母缩写的列表格式。我不仅要使它成为按行唯一的列表,还要将每个缩写与一个单独的列表(在tla_$tla上找到1,200个3个字母的缩写)进行比较,并且仅在该列表中找到它时才保留它。然后,我想将每个唯一的和已批准的事件放入各自的列中。

当前,我将它嵌套在for循环和if语句中 将这些分开的,唯一的且已批准的缩写放入单独的data.table-中,但这不是必需的。我有一个代码可以在较小的规模上做到这一点,但是对于拥有超过500,000行的大型data.table来说,处理时间太长了,我希望找到一种更快,更有效的方法。 / p>

library(data.table)

#sample data
  test_data <- structure(
    list(  number = c(
        "INC000008561475",
        "INC000008561472",
        "INC000008561471",
        "INC000008561469",
        "INC000008561468",
        "INC000008561467",
        "INC000008561466",
        "INC000008561465",
        "INC000008561464",
        "INC000008561462",
        "INC000008561459"
      ),
      find_tla = list( 
        c("DBA", "DBA", "ORA", "DBA", "APP"),
        character(0),
        "IDM",
        "DDM",
        character(0), 
        c("UAT", "UAT","DDM", "UAT", "UAT"),
        character(0),
        character(0),
        c("APP","ORC","CDT","WEB","WEB","DBA","ORC","WEB","URL","APP","ORC","WEB","URL"),
        "APP",
        c("DBA", "DBA", "ORA", "DBA", "GDP", "OPS", "APP")
      )
    ),
    row.names = c(NA,-14L),
    class = c("data.table",
              "data.frame"))

#sample approved abbreviations list
tla_ <- structure(
  list(
    tla = c(
      "CDT",
      "DBA",
      "IDM",
      "ORC"
    )
  ),
  row.names = c("TLA"),
  class = c("data.table",
            "data.frame")
)



#sample of where I am currently storing separated abbreviations
    keep_tla <- data.table(number = test_data$number)
    keep_tla[, `:=`(r1 = "")]
    keep_tla[, `:=`(r2 = "")]
    keep_tla[, `:=`(r3 = "")]

这是我当前正在使用的代码:

  col=2
for (j in 1:length(test_data$find_tla)) {
     l <- unique(strsplit(unlist(test_data[j, "find_tla"]), ","))

     {
    for (i in 1:length(l))
    {
      if (length(l) == 0) {

      } else if (l[i] %in% tla_$TLA) {

      } else{
        keep_tla[j, col] <- l[i]
        col <- col + 1
      }
    }
  }
  col = 2
}

现在,使用更大的数据集,结果是要花费多个小时的处理时间,并创建data.table keep_tla,并在它们起源的同一行中保留唯一且经过批准的缩写,但现在将其分成所需的列数。

#current sample output
>keep_tla
             number  r1  r2  r3
 1: INC000008561475 DBA
 2: INC000008561472
 3: INC000008561471 IDM
 4: INC000008561469
 5: INC000008561468
 6: INC000008561467
 7: INC000008561466
 8: INC000008561465
 9: INC000008561464 
10: INC000008561462 ORC DBA CDT
11: INC000008561459 DBA

我可以使用更有效的代码来保留此输出,甚至只是将行的唯一,认可的,唯一的列添加到test_data的末尾。

2 个答案:

答案 0 :(得分:1)

也许这种格式对您有用?

tla <- c("CDT", "DBA", "IDM", "ORC")

test_data[, (tla) := as.data.frame(t(sapply(find_tla, function(ftla) { tla %in% ftla })))]

all_false <- test_data[, lapply(.SD, Negate(any)), .SDcols = tla]
all_false <- names(which(unlist(all_false)))

if (length(all_false) > 0L) {
  test_data[, (all_false) := NULL]
}

test_data[]
             number                    find_tla   DBA   IDM   ORC
 1: INC000008561475         DBA,DBA,ORA,DBA,APP  TRUE FALSE FALSE
 2: INC000008561472                             FALSE FALSE FALSE
 3: INC000008561471                         IDM FALSE  TRUE FALSE
 4: INC000008561469                         DDM FALSE FALSE FALSE
 5: INC000008561468                             FALSE FALSE FALSE
 6: INC000008561467         UAT,UAT,DDM,UAT,UAT FALSE FALSE FALSE
 7: INC000008561466                             FALSE FALSE FALSE
 8: INC000008561465                             FALSE FALSE FALSE
 9: INC000008561464 APP,ORC,WEB,ORC,WEB,APP,... FALSE FALSE  TRUE
10: INC000008561462                         APP FALSE FALSE FALSE
11: INC000008561459 DBA,DBA,ORA,DBA,GDP,OPS,...  TRUE FALSE FALSE

在这里,tla保留了批准的缩写, 第二行为每一行添加一列, 如果相应TRUE中的列表,则该行的值为find_tla 包含它。 通过使用sapply首先将检查应用于所有行的find_tla列表来获得这些值, 但由于sapply按列绑定, 结果必须转置。

下面的行用于检查某些列是否全部为FALSE, 即给定的缩写在任何行中都不存在。 这些列已删除。

如果您确实需要每行中的实际字符, 您可以在此后添加以下内容:

remaining_tla <- setdiff(tla, all_false)
test_data[, (remaining_tla) := Map(ifelse, .SD, yes = names(.SD), no = NA_character_), .SDcols = remaining_tla]

test_data[]
             number                    find_tla  DBA  IDM  ORC
 1: INC000008561475         DBA,DBA,ORA,DBA,APP  DBA <NA> <NA>
 2: INC000008561472                             <NA> <NA> <NA>
 3: INC000008561471                         IDM <NA>  IDM <NA>
 4: INC000008561469                         DDM <NA> <NA> <NA>
 5: INC000008561468                             <NA> <NA> <NA>
 6: INC000008561467         UAT,UAT,DDM,UAT,UAT <NA> <NA> <NA>
 7: INC000008561466                             <NA> <NA> <NA>
 8: INC000008561465                             <NA> <NA> <NA>
 9: INC000008561464 APP,ORC,WEB,ORC,WEB,APP,... <NA> <NA>  ORC
10: INC000008561462                         APP <NA> <NA> <NA>
11: INC000008561459 DBA,DBA,ORA,DBA,GDP,OPS,...  DBA <NA> <NA>

答案 1 :(得分:0)

有多种方法可以回答这个问题。这是另外两种方法,以及不同方法的标记。

所有人都要求number是唯一键。

向后intersect()

test_data[, .(keep_tla = list(intersect(find_tla[[1]], tla_$tla))), by = number]
             number    keep_tla
 1: INC000008561475         DBA
 2: INC000008561472            
 3: INC000008561471         IDM
 4: INC000008561469            
 5: INC000008561468            
 6: INC000008561467            
 7: INC000008561466            
 8: INC000008561465            
 9: INC000008561464 ORC,DBA,CDT
10: INC000008561462            
11: INC000008561459         DBA

keep_tla列仍然是列表。

重塑并加入

OP的预期输出具有有效tla值的单独列。

这可以通过将其重塑为长格式,通过与tla联接来过滤允许的tla_值,然后再次将其重塑为宽格式来实现。

library(data.table)
library(magrittr)
tla_[test_data[, unlist(find_tla) %>% unique(), by = number], 
     on =.(tla = V1), nomatch = NULL] %>% 
  dcast(number ~ rowid(number, prefix = "r"), value.var = "tla")
            number  r1   r2   r3
1: INC000008561459 DBA <NA> <NA>
2: INC000008561464 ORC  DBA  CDT
3: INC000008561471 IDM <NA> <NA>
4: INC000008561475 DBA <NA> <NA>

不幸的是,过滤(联接)也删除了具有 no find_tla值的行。这些行需要通过第二个连接来补充:

tla_[test_data[, unlist(find_tla) %>% unique(), by = number], 
     on =.(tla = V1), nomatch = NULL] %>% 
  dcast(number ~ rowid(number, prefix = "r"), value.var = "tla") %>% 
  .[test_data[, .(number)], on = "number"]
             number   r1   r2   r3
 1: INC000008561475  DBA <NA> <NA>
 2: INC000008561472 <NA> <NA> <NA>
 3: INC000008561471  IDM <NA> <NA>
 4: INC000008561469 <NA> <NA> <NA>
 5: INC000008561468 <NA> <NA> <NA>
 6: INC000008561467 <NA> <NA> <NA>
 7: INC000008561466 <NA> <NA> <NA>
 8: INC000008561465 <NA> <NA> <NA>
 9: INC000008561464  ORC  DBA  CDT
10: INC000008561462 <NA> <NA> <NA>
11: INC000008561459  DBA <NA> <NA>

最后,如果我们想完全重现OP的预期结果,我们需要将所有<NA>替换为""

keep_tla <- tla_[test_data[, unlist(find_tla) %>% unique(), by = number], 
     on =.(tla = V1), nomatch = NULL] %>% 
  dcast(number ~ rowid(number, prefix = "r"), value.var = "tla") %>% 
  .[test_data[, .(number)], on = "number"]
for( x in names(keep_tla)) set(keep_tla, which(is.na(keep_tla[, ..x])), x, "")
keep_tla
             number  r1  r2  r3
 1: INC000008561475 DBA        
 2: INC000008561472            
 3: INC000008561471 IDM        
 4: INC000008561469            
 5: INC000008561468            
 6: INC000008561467            
 7: INC000008561466            
 8: INC000008561465            
 9: INC000008561464 ORC DBA CDT
10: INC000008561462            
11: INC000008561459 DBA

数据

在这里,我们使用test_data的修改版本,该版本与预期结果相符:

test_data <- structure(
  list(  number = c(
    "INC000008561475",
    "INC000008561472",
    "INC000008561471",
    "INC000008561469",
    "INC000008561468",
    "INC000008561467",
    "INC000008561466",
    "INC000008561465",
    "INC000008561464",
    "INC000008561462",
    "INC000008561459"
  ),
  find_tla = list( 
    c("DBA", "DBA", "ORA", "DBA", "APP"),
    character(0),
    "IDM",
    "DDM",
    character(0), 
    c("UAT", "UAT","DDM", "UAT", "UAT"),
    character(0),
    character(0),
    c("APP","ORC","WEB","ORC","WEB","APP","DBA","WEB","URL","APP","CDT","WEB","URL"),
    "APP",
    c("DBA", "DBA", "ORA", "DBA", "GDP", "OPS", "APP")
  )
  ),
  row.names = c(NA,-14L),
  class = c("data.table",
            "data.frame"))

基准

OP希望找到一种更快,更有效的方法来处理他的500 k行的生产数据集。所以这是一个比较的基准

  • intersect()方法的3种不同变体,
  • 重塑(dcast())和联接方法,以及
  • Alexis'答案的第一部分。

由于OP正在寻求更快的解决方案,因此尚未考虑OP的双嵌套for循环。

对于100、1000、10 k和100 k行重复执行基准测试。为每种情况创建测试数据。当两个解决方案修改测试数据时,每次基准测试运行都将重新复制。由于使用不同的方法会返回结构上不同的结果,因此自动检查结果的一致性已被关闭。

# benchmark

all_tla <- CJ(LETTERS, LETTERS, LETTERS)[, Reduce(paste0, .SD)][1:2400]
valid_tla <- all_tla[1:1200]
tla_ <- data.table(tla = valid_tla)
library(bench)
bm <- press(
  n_rows = 10^c(2:5),
  {
    set.seed(123L)
    td0 <- data.table(
      number = seq(n_rows),
      find_tla = replicate(n_rows, list(sample(all_tla, rchisq(1000, 2), replace = TRUE)))
    )
    mark(
      insect1 = {
        td <- copy(td0)
        td[, .(keep_tla = list(intersect(find_tla[[1]], valid_tla))), by = number]
      },
      insect1u = {
        td <- copy(td0)
        td[, .(keep_tla = list(intersect(unique(find_tla[[1]]), valid_tla))), by = number]
      },
      insect2 = {
        td <- copy(td0)
        td[, .(number, keep_tla = lapply(find_tla, function(x) intersect(x, valid_tla)))]
      },
      insect3 = {
        td <- copy(td0)
        td[, find_tla := lapply(find_tla, function(x) intersect(x, valid_tla))][]
      },
      dcast = {
        td <- copy(td0)
        tla_[td[, unlist(find_tla) %>% unique(), by = number], 
             on =.(tla = V1), nomatch = NULL] %>% 
          dcast(number ~ rowid(number, prefix = "r"), value.var = "tla") %>% 
          .[td[, .(number)], on = "number"]
      },
      Alexis = {
        td <- copy(td0)
        tla <- valid_tla
        td[, (tla) := as.data.frame(t(sapply(find_tla, function(ftla) { tla %in% ftla })))]
        all_false <- td[, lapply(.SD, Negate(any)), .SDcols = tla]
        all_false <- names(which(unlist(all_false)))
        if (length(all_false) > 0L) {
          td[, (all_false) := NULL]
        }
        td[]
      },
      min_time = 2,
      check = FALSE
    )
  }
)

时间可以通过

可视化
library(ggplot2)
autoplot(bm)

enter image description here

(请注意对数时间刻度)。

“昆虫”变体比其​​他两种方法快得多,并且可以相当快地处理500 k行的生产数据集。

另一方面,如果可用计算机内存有限,则可能要考虑内存消耗。在这里,“ dcast”方法分配的内存比所有其他方法少得多,而Alexis的方法要求最高。

print(bm, n = Inf)
# A tibble: 20 x 14
   expression n_rows      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory    time  gc    
   <bch:expr>  <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>    <lis> <list>
 1 insect1       100   2.21ms   2.83ms  340.        1.24MB    3.47    589     6      1.73s <data.ta~ <Rprofme~ <bch~ <tibb~
 2 insect2       100    1.9ms   2.07ms  444.        1.24MB    3.65    851     7      1.92s <data.ta~ <Rprofme~ <bch~ <tibb~
 3 insect3       100   2.05ms   2.25ms  406.        1.21MB    4.28    759     8      1.87s <data.ta~ <Rprofme~ <bch~ <tibb~
 4 dcast         100  22.53ms  26.66ms   37.5     583.34KB    2.75     68     5      1.82s <data.ta~ <Rprofme~ <bch~ <tibb~
 5 Alexis        100  36.21ms  38.12ms   25.5       5.43MB    6.03     38     9      1.49s <data.ta~ <Rprofme~ <bch~ <tibb~
 6 insect1      1000  16.31ms  17.56ms   55.8       11.8MB    4.39     89     7      1.59s <data.ta~ <Rprofme~ <bch~ <tibb~
 7 insect2      1000  15.07ms  16.62ms   58.0      11.79MB    3.83    106     7      1.83s <data.ta~ <Rprofme~ <bch~ <tibb~
 8 insect3      1000  18.56ms  19.56ms   49.4      11.79MB    4.59     86     8      1.74s <data.ta~ <Rprofme~ <bch~ <tibb~
 9 dcast        1000 107.46ms 109.64ms    9.06      1.11MB   10.2       8     9   883.43ms <data.ta~ <Rprofme~ <bch~ <tibb~
10 Alexis       1000  59.44ms  64.22ms   15.2      43.52MB    5.79     21     8      1.38s <data.ta~ <Rprofme~ <bch~ <tibb~
11 insect1     10000  159.8ms 175.13ms    5.66    114.96MB    4.71     12    10      2.12s <data.ta~ <Rprofme~ <bch~ <tibb~
12 insect2     10000 151.99ms 187.61ms    5.26    114.84MB    4.30     11     9      2.09s <data.ta~ <Rprofme~ <bch~ <tibb~
13 insect3     10000 219.12ms 272.98ms    3.45    115.12MB    3.94      7     8      2.03s <data.ta~ <Rprofme~ <bch~ <tibb~
14 dcast       10000 941.24ms    1.11s    0.898     7.66MB    5.84      2    13      2.23s <data.ta~ <Rprofme~ <bch~ <tibb~
15 Alexis      10000 624.61ms 633.59ms    1.42    424.82MB    6.13      3    13      2.12s <data.ta~ <Rprofme~ <bch~ <tibb~
16 insect1    100000    1.86s    1.91s    0.523     1.11GB    1.57      2     6      3.82s <data.ta~ <Rprofme~ <bch~ <tibb~
17 insect2    100000    2.56s    2.56s    0.391     1.11GB    0.783     1     2      2.56s <data.ta~ <Rprofme~ <bch~ <tibb~
18 insect3    100000    2.21s    2.21s    0.453     1.11GB    1.36      1     3      2.21s <data.ta~ <Rprofme~ <bch~ <tibb~
19 dcast      100000   10.34s   10.34s    0.0968   73.18MB    1.84      1    19     10.34s <data.ta~ <Rprofme~ <bch~ <tibb~
20 Alexis     100000   11.12s   11.12s    0.0900    4.13GB    0.540     1     6     11.12s <data.ta~ <Rprofme~ <bch~ <tibb~