我对R来说比较新。
我的数据框test
看起来像这样(纯文本只有1个变量X1
,但最多可以有2000万行):
DP - 2017 Jan 01
TI - Case Report of Severe Antithrombin Deficiency During Extracorporeal Membrane
Oxygenation and Therapeutic Plasma Exchange for Double Lung Transplantation.
PG - 11-13
LID - 10.1213/XAA.0000000000000412 [doi]
AB - Acquired antithrombin (AT) deficiency is not uncommon in cardiothoracic surgery
because of heparin exposure and dilutional or consumptive losses. We report a
case of acquired AT deficiency and resultant multiple deep vein thrombosis in a
patient with pulmonary fibrosis on veno-venous extracorporeal membrane
AD - From the Departments of *Anesthesiology and daggerCardiothoracic Surgery,
University of Maryland, Baltimore, Maryland.
JT - Saudi journal of kidney diseases and transplantation : an official publication of
the Saudi Center for Organ Transplantation, Saudi Arabia
JID - 9436968
我想重新创建"标签"对于没有的行(在开头也有3个空格),使用前面的标签。但是,我只需要为TI
和JT
重新创建标签,因为这些标签将是我最终需要提取的唯一行。
基本上,我的结果数据框应如下所示:
DP - 2017 Jan 01
TI - Case Report of Severe Antithrombin Deficiency During Extracorporeal Membrane
TI - Oxygenation and Therapeutic Plasma Exchange for Double Lung Transplantation.
PG - 11-13
LID - 10.1213/XAA.0000000000000412 [doi]
AB - Acquired antithrombin (AT) deficiency is not uncommon in cardiothoracic surgery
because of heparin exposure and dilutional or consumptive losses. We report a
case of acquired AT deficiency and resultant multiple deep vein thrombosis in a
patient with pulmonary fibrosis on veno-venous extracorporeal membrane
AD - From the Departments of *Anesthesiology and daggerCardiothoracic Surgery,
University of Maryland, Baltimore, Maryland.
JT - Saudi journal of kidney diseases and transplantation : an official publication of
JT - the Saudi Center for Organ Transplantation, Saudi Arabia
JID - 9436968
在没有"标签的行前面有3个空格,"所以这就是我目前的代码:
for (n in 1:nrow(test))
{
if (substr(test$X1[n], 1, 3) == " " && (substr(test$X1[n-1], 1, 2) == "TI" || substr(test$X1[n-1], 1, 2) == "JT"))
{
if (n > 1)
{
subs <- substr(test$X1[[n-1]], 1, 6)
}
subs <- substr(test$X1[[n-1]], 1, 6)
test$X1[n] <- sub(" ", subs, test$X1[n])
}
}
我当前的解决方案有效,但需要永久运行跨越2000万行的文本。请注意,因为我需要在多个大文件上运行此脚本。
感谢。
答案 0 :(得分:1)
1)我重写了你的功能:
yourFunction <- function(test) {
for (n in 2:nrow(test)) {
if (substr(test$X1[n], 1, 3) == " " &&
(substr(test$X1[n - 1], 1, 2) == "TI" ||
substr(test$X1[n - 1], 1, 2) == "JT")) {
subs <- substr(test$X1[[n - 1]], 1, 6)
test$X1[n] <- sub(" ", subs, test$X1[n])
}
}
test
}
2)让我们创建小数据集,看看我们的两个函数是如何工作的:
# small test dataset:
require(data.table)
variants <-
c("TI - text", " text2", "AD - text3", "JT - text4")
n <- 10
set.seed(26)
dt <- data.table(X1 = sample(variants, size = n, replace = T))
dt
X1
1: TI - text
2: text2
3: JT - text4
4: JT - text4
5: text2
6: text2
7: JT - text4
8: AD - text3
9: text2
10: AD - text3
3)yourFunction的结果:
yourFunction(dt)
X1
1: TI - text
2: TI - text2
3: JT - text4
4: JT - text4
5: JT - text2
6: JT - text2
7: JT - text4
8: AD - text3
9: text2
10: AD - text3
4)我使用zoo
,data.table
和stringi
写了这个函数(可能你可以在没有最后两个包的情况下做得很好)
myFunction1 <- function(dt) {
require(zoo)
require(stringi)
require(data.table)
d <- copy(dt)
d[, v6 := substr(X1, 1, 6)]
# d[, v3 := substr(v6, 1, 3)]
# d[, emty := ifelse(v3 == " ", T, F)]
d[v6 == " ", v6 := NA]
d[, v6 := na.locf(v6, na.rm = F)]
d[is.na(v6), v6 := " "]
stri_sub(d$X1, 1, 6) <- d$v6
d[, "X1", with = F]
}
5)主持结果:
r1 <- yourFunction(dt)
r2 <- myFunction1(dt)
all.equal(r1, r2)
[1] "Column 'X1': 1 string mismatch"
r2
X1
1: TI - text
2: TI - text2
3: JT - text4
4: JT - text4
5: JT - text2
6: JT - text2
7: JT - text4
8: AD - text3
9: AD - text2
10: AD - text3
结果不一样,我重新创建了你不想要/不需要的标签。如果你需要删除它们,那么你可以找到一些方法,但这种方法要快得多。
6)基准:(当n非常小时,你的功能更快)
# when n = 10
require(rbenchmark)
benchmark(myFunction1(dt),
yourFunction(dt), replications = 100,
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1 myFunction1(dt) 100 0.21 2.333
# 2 yourFunction(dt) 100 0.09 1.000
# when 1k / with 10 replications
n <- 1 * 1000
set.seed(231)
test <- sample(variants, size = n, replace = T)
dt <- data.table(X1 = test)
benchmark(myFunction1(dt),
yourFunction(dt), replications = 10,
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1 myFunction1(dt) 10 0.03 1.000
# 2 yourFunction(dt) 10 0.52 17.333
# when 50k
n <- 50 * 1000
set.seed(231)
test <- sample(variants, size = n, replace = T)
dt <- data.table(X1 = test)
dt
benchmark(myFunction1(dt),
yourFunction(dt), replications = 1,
columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1 myFunction1(dt) 1 0.01 1
# 2 yourFunction(dt) 1 7.09 709
# time for 20 mil rows:
n <- 20e6
set.seed(231)
test <- sample(variants, size = n, replace = T)
dt <- data.table(X1 = test)
dt
system.time(myFunction1(dt))
# user system elapsed
# 6.23 0.78 7.04