我有一个包含一列字符串的数据集:
text <- c('flight cancelled','dog cat','coach travel','car bus','cow sheep',' high bar')
transport <- 0
df <- data.frame(text,transport)
对于每行,如果字符串“ text”包含多个单词中的任何一个,我想返回1;否则返回0。我的问题是我能想到的唯一方法是使用for循环。有更有效的方法吗?我的数据集很大,因此for循环需要永远的运行
words<- 'flight|flights|plane|seats|seat|travel|time|coach'
for (i in 1:6){
df$transport[i] <- ifelse(any(grepl(words,(str_split(as.character(df$text[i]), " ")))) == TRUE,1,0)
}
返回:
text transport
1 flight cancelled 1
2 dog cat 0
3 coach travel 1
4 car bus 0
5 cow sheep 0
6 high bar 0
答案 0 :(得分:5)
您可以直接在words
中使用df$text
和grep
查找要设置为1的行。
df$transport[grep(words, df$text)] <- 1
答案 1 :(得分:2)
这是一种可能性:
df <- data.frame(text = c('flight cancelled','dog cat','coach travel','car bus','cow sheep',' high bar'),
transport = 0)
words <- 'flight|flights|plane|seats|seat|travel|time|coach'
df[grep(words, df$text, value = F), "transport"] <- 1
text transport
1 flight cancelled 1
2 dog cat 0
3 coach travel 1
4 car bus 0
5 cow sheep 0
6 high bar 0
答案 2 :(得分:2)
您还可以使用apply函数:
apply(df, 1, function(x) ifelse(any(grepl(words,(str_split(as.character(x["text"]), " ")))) == TRUE,1,0))
答案 3 :(得分:2)
如果您正在寻找速度,则stringr
或stringi
函数通常会胜过基本函数:
library(stringr)
as.integer(str_detect(df$text, words))
[1] 1 0 1 0 0 0
编辑:另外请注意,请考虑使用单词边界,以免部分匹配(例如,flight
与单词flights
匹配)
paste0("\\b", gsub("|", "\\b|\\b", words, fixed = T), "\\b")
[1] "\\bflight\\b|\\bflights\\b|\\bplane\\b|\\bseats\\b|\\bseat\\b|\\btravel\\b|\\btime\\b|\\bcoach\\b"