我有一个包含逗号分隔值的列的数据集。我需要解析此列中的每个值,并仅保留特定值并删除其他值。
我的代码和数据是:
myDf <- structure(list(GeogPreferences = structure(1:4, .Label = c("Central and East Europe, Europe, North America, West Europe, US",
"Europe, North America, West Europe, US", "Global, North America",
"Northeast, Southeast, West, US"), class = "factor")), .Names = "GeogPreferences", class = "data.frame", row.names = c(NA,
-4L))
regionInterest <- c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
z <- ifelse((z %in% regionInterest), z[z %in% regionInterest], z)
})
myDf$GeogPreferences<-unlist(k)
这是我得到的错误:
Error in `$<-.data.frame`(`*tmp*`, "GeogPreferences", value = c("Central and East Europe",
: replacement has 15 rows, data has 4
我的数据集如下所示:
GeogPreferences
1 Central and East Europe, Europe, North America, West Europe, US
2 Europe, North America, West Europe, US
3 Global, North America
4 Northeast, Southeast, West, US
如果列中有regionInterest
的字符串,我想保留该字符串,否则我想删除它。
我期待的输出是:
GeogPreferences
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
有人可以帮我解决我做错的事吗?谢谢!
答案 0 :(得分:3)
您获得的错误是由strsplit
创建的行数多于输入df。同样在ifelse
声明中,您在FALSE
上返回z,因此它没有按照您的意图行事。
以下是您问题的tidyr
/ dplyr
解决方案。
myDf %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
# A tibble: 4 × 1
GeogPreferences
<chr>
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
答案 1 :(得分:3)
您可能首先拆分数据,然后运行子集。
这将提高效率(因为strsplit
它是矢量化的)并且每个分裂中的矢量大小都不重要。此外,在trimws
中不需要它只会使您的代码效率低下。相反,在指定", "
时拆分fixed = TRUE
。这将使strsplit
的工作速度提高X10倍,因为它不会使用正则表达式进行拆分。
以下作品仅使用基础R
do.call(rbind, # you can use `rbind.data.frame` instead if you don't want a matrix
lapply(strsplit(as.character(myDf$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
# [,1]
# [1,] "North America, US"
# [2,] "North America, US"
# [3,] "North America"
# [4,] "Northeast, Southeast, West, US"
虽然上述解决方案(与您自己类似)仍然是行式解决方案。相反,我们可以尝试通过逐列操作来实现相同的效果。并且通过&#34; columnwise&#34;我的意思是,如果我们将对转置分割进行操作,迭代次数将是myDf$GeogPreferences
中最长句子的大小(我们分割的逗号的数量),它应该明显小于数据中的行。
她是使用data.table::tstrsplit
tmp <- data.table::tstrsplit(myDf$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste,
c(sep = ", ",
lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
# [1] "North America, US" "North America, US" "North America" "Northeast, Southeast, West, US"
以下是100K行数据集的简单基准
bigDF <- myDf[sample(nrow(myDf), 1e5, replace = TRUE),, drop = FALSE]
library(dplyr)
library(tidyr)
library(data.table)
tidyverse <- function(x) {
x %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
}
MF <- function(x) {
k <- lapply(as.character(x$GeogPreferences), function(x) {
z <- trimws(unlist(strsplit(x, split = ",")))
z <- z[z %in% regionInterest]
})
sapply(k, paste, collapse = ", ")
}
DA1 <- function(x) {
do.call(rbind,
lapply(strsplit(as.character(x$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
}
DA2 <- function(x) {
tmp <- data.table::tstrsplit(x$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste,
c(sep = ", ",
lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
}
system.time(tidyverse(bigDF))
# user system elapsed
# 17.67 0.01 17.91
system.time(MF(bigDF))
# user system elapsed
# 15.52 0.00 15.70
system.time(DA1(bigDF))
# user system elapsed
# 0.97 0.00 1.00
system.time(DA2(bigDF))
# user system elapsed
# 0.25 0.00 0.25
所以其他两个解决方案运行时间超过15秒,而我的两个解决方案运行时间不到一秒
答案 2 :(得分:2)
如果您更喜欢接近您的方法的解决方案,请将其更改为
regionInterest <- c("Americas", "North America", "US",
"Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
# this makes sure you only use z which are in regionInterest
z <- z[z %in% regionInterest]
})
# paste with collapse creates one value out of a vector of string seperated by the collapse argument
myDf$GeogPreferences<-sapply(k, paste, collapse = ", ")
我希望这会有所帮助