在我的R
data.table
中,有一列find_tla
是一种从文本字段中提取的大写三字母缩写的列表格式。我不仅要使它成为按行唯一的列表,还要将每个缩写与一个单独的列表(在tla_$tla
上找到1,200个3个字母的缩写)进行比较,并且仅在该列表中找到它时才保留它。然后,我想将每个唯一的和已批准的事件放入各自的列中。
当前,我将它嵌套在for循环和if语句中
将这些分开的,唯一的且已批准的缩写放入单独的data.table
-中,但这不是必需的。我有一个代码可以在较小的规模上做到这一点,但是对于拥有超过500,000行的大型data.table
来说,处理时间太长了,我希望找到一种更快,更有效的方法。 / p>
library(data.table)
#sample data
test_data <- structure(
list( number = c(
"INC000008561475",
"INC000008561472",
"INC000008561471",
"INC000008561469",
"INC000008561468",
"INC000008561467",
"INC000008561466",
"INC000008561465",
"INC000008561464",
"INC000008561462",
"INC000008561459"
),
find_tla = list(
c("DBA", "DBA", "ORA", "DBA", "APP"),
character(0),
"IDM",
"DDM",
character(0),
c("UAT", "UAT","DDM", "UAT", "UAT"),
character(0),
character(0),
c("APP","ORC","CDT","WEB","WEB","DBA","ORC","WEB","URL","APP","ORC","WEB","URL"),
"APP",
c("DBA", "DBA", "ORA", "DBA", "GDP", "OPS", "APP")
)
),
row.names = c(NA,-14L),
class = c("data.table",
"data.frame"))
#sample approved abbreviations list
tla_ <- structure(
list(
tla = c(
"CDT",
"DBA",
"IDM",
"ORC"
)
),
row.names = c("TLA"),
class = c("data.table",
"data.frame")
)
#sample of where I am currently storing separated abbreviations
keep_tla <- data.table(number = test_data$number)
keep_tla[, `:=`(r1 = "")]
keep_tla[, `:=`(r2 = "")]
keep_tla[, `:=`(r3 = "")]
这是我当前正在使用的代码:
col=2
for (j in 1:length(test_data$find_tla)) {
l <- unique(strsplit(unlist(test_data[j, "find_tla"]), ","))
{
for (i in 1:length(l))
{
if (length(l) == 0) {
} else if (l[i] %in% tla_$TLA) {
} else{
keep_tla[j, col] <- l[i]
col <- col + 1
}
}
}
col = 2
}
现在,使用更大的数据集,结果是要花费多个小时的处理时间,并创建data.table
keep_tla
,并在它们起源的同一行中保留唯一且经过批准的缩写,但现在将其分成所需的列数。
#current sample output
>keep_tla
number r1 r2 r3
1: INC000008561475 DBA
2: INC000008561472
3: INC000008561471 IDM
4: INC000008561469
5: INC000008561468
6: INC000008561467
7: INC000008561466
8: INC000008561465
9: INC000008561464
10: INC000008561462 ORC DBA CDT
11: INC000008561459 DBA
我可以使用更有效的代码来保留此输出,甚至只是将行的唯一,认可的,唯一的列添加到test_data的末尾。
答案 0 :(得分:1)
也许这种格式对您有用?
tla <- c("CDT", "DBA", "IDM", "ORC")
test_data[, (tla) := as.data.frame(t(sapply(find_tla, function(ftla) { tla %in% ftla })))]
all_false <- test_data[, lapply(.SD, Negate(any)), .SDcols = tla]
all_false <- names(which(unlist(all_false)))
if (length(all_false) > 0L) {
test_data[, (all_false) := NULL]
}
test_data[]
number find_tla DBA IDM ORC
1: INC000008561475 DBA,DBA,ORA,DBA,APP TRUE FALSE FALSE
2: INC000008561472 FALSE FALSE FALSE
3: INC000008561471 IDM FALSE TRUE FALSE
4: INC000008561469 DDM FALSE FALSE FALSE
5: INC000008561468 FALSE FALSE FALSE
6: INC000008561467 UAT,UAT,DDM,UAT,UAT FALSE FALSE FALSE
7: INC000008561466 FALSE FALSE FALSE
8: INC000008561465 FALSE FALSE FALSE
9: INC000008561464 APP,ORC,WEB,ORC,WEB,APP,... FALSE FALSE TRUE
10: INC000008561462 APP FALSE FALSE FALSE
11: INC000008561459 DBA,DBA,ORA,DBA,GDP,OPS,... TRUE FALSE FALSE
在这里,tla
保留了批准的缩写,
第二行为每一行添加一列,
如果相应TRUE
中的列表,则该行的值为find_tla
包含它。
通过使用sapply
首先将检查应用于所有行的find_tla
列表来获得这些值,
但由于sapply
按列绑定,
结果必须转置。
下面的行用于检查某些列是否全部为FALSE
,
即给定的缩写在任何行中都不存在。
这些列已删除。
如果您确实需要每行中的实际字符, 您可以在此后添加以下内容:
remaining_tla <- setdiff(tla, all_false)
test_data[, (remaining_tla) := Map(ifelse, .SD, yes = names(.SD), no = NA_character_), .SDcols = remaining_tla]
test_data[]
number find_tla DBA IDM ORC
1: INC000008561475 DBA,DBA,ORA,DBA,APP DBA <NA> <NA>
2: INC000008561472 <NA> <NA> <NA>
3: INC000008561471 IDM <NA> IDM <NA>
4: INC000008561469 DDM <NA> <NA> <NA>
5: INC000008561468 <NA> <NA> <NA>
6: INC000008561467 UAT,UAT,DDM,UAT,UAT <NA> <NA> <NA>
7: INC000008561466 <NA> <NA> <NA>
8: INC000008561465 <NA> <NA> <NA>
9: INC000008561464 APP,ORC,WEB,ORC,WEB,APP,... <NA> <NA> ORC
10: INC000008561462 APP <NA> <NA> <NA>
11: INC000008561459 DBA,DBA,ORA,DBA,GDP,OPS,... DBA <NA> <NA>
答案 1 :(得分:0)
有多种方法可以回答这个问题。这是另外两种方法,以及不同方法的标记。
所有人都要求number
是唯一键。
intersect()
test_data[, .(keep_tla = list(intersect(find_tla[[1]], tla_$tla))), by = number]
number keep_tla 1: INC000008561475 DBA 2: INC000008561472 3: INC000008561471 IDM 4: INC000008561469 5: INC000008561468 6: INC000008561467 7: INC000008561466 8: INC000008561465 9: INC000008561464 ORC,DBA,CDT 10: INC000008561462 11: INC000008561459 DBA
keep_tla
列仍然是列表。
OP的预期输出具有有效tla
值的单独列。
这可以通过将其重塑为长格式,通过与tla
联接来过滤允许的tla_
值,然后再次将其重塑为宽格式来实现。
library(data.table)
library(magrittr)
tla_[test_data[, unlist(find_tla) %>% unique(), by = number],
on =.(tla = V1), nomatch = NULL] %>%
dcast(number ~ rowid(number, prefix = "r"), value.var = "tla")
number r1 r2 r3 1: INC000008561459 DBA <NA> <NA> 2: INC000008561464 ORC DBA CDT 3: INC000008561471 IDM <NA> <NA> 4: INC000008561475 DBA <NA> <NA>
不幸的是,过滤(联接)也删除了具有 no find_tla
值的行。这些行需要通过第二个连接来补充:
tla_[test_data[, unlist(find_tla) %>% unique(), by = number],
on =.(tla = V1), nomatch = NULL] %>%
dcast(number ~ rowid(number, prefix = "r"), value.var = "tla") %>%
.[test_data[, .(number)], on = "number"]
number r1 r2 r3 1: INC000008561475 DBA <NA> <NA> 2: INC000008561472 <NA> <NA> <NA> 3: INC000008561471 IDM <NA> <NA> 4: INC000008561469 <NA> <NA> <NA> 5: INC000008561468 <NA> <NA> <NA> 6: INC000008561467 <NA> <NA> <NA> 7: INC000008561466 <NA> <NA> <NA> 8: INC000008561465 <NA> <NA> <NA> 9: INC000008561464 ORC DBA CDT 10: INC000008561462 <NA> <NA> <NA> 11: INC000008561459 DBA <NA> <NA>
最后,如果我们想完全重现OP的预期结果,我们需要将所有<NA>
替换为""
:
keep_tla <- tla_[test_data[, unlist(find_tla) %>% unique(), by = number],
on =.(tla = V1), nomatch = NULL] %>%
dcast(number ~ rowid(number, prefix = "r"), value.var = "tla") %>%
.[test_data[, .(number)], on = "number"]
for( x in names(keep_tla)) set(keep_tla, which(is.na(keep_tla[, ..x])), x, "")
keep_tla
number r1 r2 r3 1: INC000008561475 DBA 2: INC000008561472 3: INC000008561471 IDM 4: INC000008561469 5: INC000008561468 6: INC000008561467 7: INC000008561466 8: INC000008561465 9: INC000008561464 ORC DBA CDT 10: INC000008561462 11: INC000008561459 DBA
在这里,我们使用test_data
的修改版本,该版本与预期结果相符:
test_data <- structure(
list( number = c(
"INC000008561475",
"INC000008561472",
"INC000008561471",
"INC000008561469",
"INC000008561468",
"INC000008561467",
"INC000008561466",
"INC000008561465",
"INC000008561464",
"INC000008561462",
"INC000008561459"
),
find_tla = list(
c("DBA", "DBA", "ORA", "DBA", "APP"),
character(0),
"IDM",
"DDM",
character(0),
c("UAT", "UAT","DDM", "UAT", "UAT"),
character(0),
character(0),
c("APP","ORC","WEB","ORC","WEB","APP","DBA","WEB","URL","APP","CDT","WEB","URL"),
"APP",
c("DBA", "DBA", "ORA", "DBA", "GDP", "OPS", "APP")
)
),
row.names = c(NA,-14L),
class = c("data.table",
"data.frame"))
OP希望找到一种更快,更有效的方法来处理他的500 k行的生产数据集。所以这是一个比较的基准
intersect()
方法的3种不同变体,dcast()
)和联接方法,以及由于OP正在寻求更快的解决方案,因此尚未考虑OP的双嵌套for循环。
对于100、1000、10 k和100 k行重复执行基准测试。为每种情况创建测试数据。当两个解决方案修改测试数据时,每次基准测试运行都将重新复制。由于使用不同的方法会返回结构上不同的结果,因此自动检查结果的一致性已被关闭。
# benchmark
all_tla <- CJ(LETTERS, LETTERS, LETTERS)[, Reduce(paste0, .SD)][1:2400]
valid_tla <- all_tla[1:1200]
tla_ <- data.table(tla = valid_tla)
library(bench)
bm <- press(
n_rows = 10^c(2:5),
{
set.seed(123L)
td0 <- data.table(
number = seq(n_rows),
find_tla = replicate(n_rows, list(sample(all_tla, rchisq(1000, 2), replace = TRUE)))
)
mark(
insect1 = {
td <- copy(td0)
td[, .(keep_tla = list(intersect(find_tla[[1]], valid_tla))), by = number]
},
insect1u = {
td <- copy(td0)
td[, .(keep_tla = list(intersect(unique(find_tla[[1]]), valid_tla))), by = number]
},
insect2 = {
td <- copy(td0)
td[, .(number, keep_tla = lapply(find_tla, function(x) intersect(x, valid_tla)))]
},
insect3 = {
td <- copy(td0)
td[, find_tla := lapply(find_tla, function(x) intersect(x, valid_tla))][]
},
dcast = {
td <- copy(td0)
tla_[td[, unlist(find_tla) %>% unique(), by = number],
on =.(tla = V1), nomatch = NULL] %>%
dcast(number ~ rowid(number, prefix = "r"), value.var = "tla") %>%
.[td[, .(number)], on = "number"]
},
Alexis = {
td <- copy(td0)
tla <- valid_tla
td[, (tla) := as.data.frame(t(sapply(find_tla, function(ftla) { tla %in% ftla })))]
all_false <- td[, lapply(.SD, Negate(any)), .SDcols = tla]
all_false <- names(which(unlist(all_false)))
if (length(all_false) > 0L) {
td[, (all_false) := NULL]
}
td[]
},
min_time = 2,
check = FALSE
)
}
)
时间可以通过
可视化library(ggplot2)
autoplot(bm)
(请注意对数时间刻度)。
“昆虫”变体比其他两种方法快得多,并且可以相当快地处理500 k行的生产数据集。
另一方面,如果可用计算机内存有限,则可能要考虑内存消耗。在这里,“ dcast”方法分配的内存比所有其他方法少得多,而Alexis的方法要求最高。
print(bm, n = Inf)
# A tibble: 20 x 14 expression n_rows min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> <list> 1 insect1 100 2.21ms 2.83ms 340. 1.24MB 3.47 589 6 1.73s <data.ta~ <Rprofme~ <bch~ <tibb~ 2 insect2 100 1.9ms 2.07ms 444. 1.24MB 3.65 851 7 1.92s <data.ta~ <Rprofme~ <bch~ <tibb~ 3 insect3 100 2.05ms 2.25ms 406. 1.21MB 4.28 759 8 1.87s <data.ta~ <Rprofme~ <bch~ <tibb~ 4 dcast 100 22.53ms 26.66ms 37.5 583.34KB 2.75 68 5 1.82s <data.ta~ <Rprofme~ <bch~ <tibb~ 5 Alexis 100 36.21ms 38.12ms 25.5 5.43MB 6.03 38 9 1.49s <data.ta~ <Rprofme~ <bch~ <tibb~ 6 insect1 1000 16.31ms 17.56ms 55.8 11.8MB 4.39 89 7 1.59s <data.ta~ <Rprofme~ <bch~ <tibb~ 7 insect2 1000 15.07ms 16.62ms 58.0 11.79MB 3.83 106 7 1.83s <data.ta~ <Rprofme~ <bch~ <tibb~ 8 insect3 1000 18.56ms 19.56ms 49.4 11.79MB 4.59 86 8 1.74s <data.ta~ <Rprofme~ <bch~ <tibb~ 9 dcast 1000 107.46ms 109.64ms 9.06 1.11MB 10.2 8 9 883.43ms <data.ta~ <Rprofme~ <bch~ <tibb~ 10 Alexis 1000 59.44ms 64.22ms 15.2 43.52MB 5.79 21 8 1.38s <data.ta~ <Rprofme~ <bch~ <tibb~ 11 insect1 10000 159.8ms 175.13ms 5.66 114.96MB 4.71 12 10 2.12s <data.ta~ <Rprofme~ <bch~ <tibb~ 12 insect2 10000 151.99ms 187.61ms 5.26 114.84MB 4.30 11 9 2.09s <data.ta~ <Rprofme~ <bch~ <tibb~ 13 insect3 10000 219.12ms 272.98ms 3.45 115.12MB 3.94 7 8 2.03s <data.ta~ <Rprofme~ <bch~ <tibb~ 14 dcast 10000 941.24ms 1.11s 0.898 7.66MB 5.84 2 13 2.23s <data.ta~ <Rprofme~ <bch~ <tibb~ 15 Alexis 10000 624.61ms 633.59ms 1.42 424.82MB 6.13 3 13 2.12s <data.ta~ <Rprofme~ <bch~ <tibb~ 16 insect1 100000 1.86s 1.91s 0.523 1.11GB 1.57 2 6 3.82s <data.ta~ <Rprofme~ <bch~ <tibb~ 17 insect2 100000 2.56s 2.56s 0.391 1.11GB 0.783 1 2 2.56s <data.ta~ <Rprofme~ <bch~ <tibb~ 18 insect3 100000 2.21s 2.21s 0.453 1.11GB 1.36 1 3 2.21s <data.ta~ <Rprofme~ <bch~ <tibb~ 19 dcast 100000 10.34s 10.34s 0.0968 73.18MB 1.84 1 19 10.34s <data.ta~ <Rprofme~ <bch~ <tibb~ 20 Alexis 100000 11.12s 11.12s 0.0900 4.13GB 0.540 1 6 11.12s <data.ta~ <Rprofme~ <bch~ <tibb~