我的数据如下所示:
txt$txt:
my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
我有一份详尽的城市名单。在下面列出几个:
city:
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
我正在txt$txt
中搜索城市名称(来自我所在的“城市”列表),如果它们存在,则将它们提取到另一列。所以下面的简单循环对我有用......但是它需要花费大量时间在更大的数据集上。
for(i in 1:nrow(txt)){
a <- c()
for(j in 1:nrow(city)){
a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])
}
txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")
}
我尝试使用apply函数,这是我可以达到的最大值。
apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE" "NONE" "bangalore" "bkc"
Desired Output:
> txt
txt city
1 my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z. NONE
3 Hi girls..Friends meet at bangalore bangalore
4 what do u think of ccd at bkc bkc
我想在R中使用更快的进程,这与上面的for循环的作用相同。请指教。谢谢
答案 0 :(得分:3)
可以使用stri_extract_first_regex
包中的stringi
:
library(stringi)
# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")
df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))
df
# txt city
# 1 in adarsh nagar adarsh nagar
# 2 sony experia z <NA>
# 3 at bangalore bangalore
答案 1 :(得分:1)
这应该快得多:
bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))
<强>解释强>
在第一行中,我们构建了一个匹配所有城市的大型正则表达式,例如: :
(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...
然后我们将gregexpr
与regmatches
结合使用,这样我们就可以获得txt$txt
中每个元素的匹配列表。
最后,使用简单的sapply
,对于列表中的每个元素,我们将匹配的城市连接起来(在删除重复项之后,即多次提到的城市)。
答案 2 :(得分:1)
试试这个:
# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE),
city))
(res <- (sapply(1:length(txt), function(x)
paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""
# [3] "bangalore" "bkc"