我正在尝试开发一个允许我输入新功能的功能 元素到数据框,然后检查它们是否包含某些内容 词语的
df <- data.frame(keyword=c("He drives a Honda", "He goes to Ohio State"),
car=c(1,0), school=c(0,1))
df
keyword car school
He drives a Honda 1 0
He goes to Ohio State 0 1
在此数据框中,car和school是二进制值,如果来自car / school矢量的单词是关键字的一部分,则该值包含1。如果关键字中不存在单词,则分配0。
car <- c("Honda", "Chevy", "Toyota", "Ford")
school <- c("Michigan", "Ohio State", "Missouri")
我想使用一个函数在数据框中输入新关键字,同时迭代汽车和学校矢量中特定值的关键字。
main <- function(keyword){
n = strsplit(as.character(keyword), " ")[[1]]
for( i in keyword ){
if( any(n==car) ){
df$car <- c(1)
}
if( any(n==school )){
df$school <- c(1)
}
}
}
此功能未完成,会产生以下错误。因为汽车和学校的矢量长度为3,所以似乎产生了错误。
> main("He likes Ford and goes to Ohio State")
Warning message:
In n == school :
longer object length is not a multiple of shorter object length
我也不确定如何将0/1值添加到df中。对于“他喜欢福特和去俄亥俄州立大学”的关键词,我应该在汽车和学校专栏中都有1个。
keyword car school
He drives a Honda 1 0
He goes to Ohio State 0 1
He likes Honda and goes to Ohio State 1 1
请帮忙。
似乎ifelse()
函数对此任务非常有用,但我无法正确实现它。
答案 0 :(得分:10)
我认为最简单的方法是使用复合正则表达式:
library(stringr)
car <- c("Honda", "Chevy", "Toyota", "Ford")
school <- c("Michigan", "Ohio State", "Missouri")
car_match <- str_c(car, collapse = "|")
school_match <- str_c(school, collapse = "|")
df <- data.frame(keyword=c("He drives a Honda",
"He goes to Ohio State",
"He likes Ford and goes to Ohio State"))
main <- function(df) {
df$car <- str_detect(df$keyword, car_match)
df$school <- str_detect(df$keyword, school_match)
df
}
main(df)
答案 1 :(得分:5)
几个小问题,但很容易修复几个%in%
。你还需要一个特殊的逻辑表达式来解释由于空间而导致strsplit
绊倒的“俄亥俄州”。
df <- data.frame(keyword=c("He drives a Honda",
"He goes to Ohio State",
"He likes Ford and goes to Ohio State"),
car=0, school=0)
main <- function(df) {
car <- c("Honda", "Chevy", "Toyota", "Ford")
school <- c("Michigan", "Missouri")
for (i in 1:nrow(df)) {
Words = strsplit(as.character(df[i, 'keyword']), " ")[[1]]
if(any(Words %in% car)) df[i, 'car'] <- 1
if(any(Words == 'Ohio')) {
if(Words[which(Words == 'Ohio') + 1] == 'State') df[i, 'school'] <- 1
}
if(any(Words %in% school)) df[i, 'school'] <- 1
}
return(df)
}
main(df)
keyword car school
1 He drives a Honda 1 0
2 He goes to Ohio State 0 1
3 He likes Ford and goes to Ohio State 1 1
答案 2 :(得分:4)
这是一个我认为无需手动指定每个双字搜索词的版本,如wkmor1解决方案中的“Ohio State”。诀窍是改为使用grep
:
main <- function(str,df){
carSearch <- unlist(lapply(car,grep,x=str,fixed=TRUE))
schoolSearch <- unlist(lapply(school,grep,x=str,fixed=TRUE))
t1 <- length(carSearch) != 0
t2 <- length(schoolSearch) != 0
if (t1 | t2){
newRow <- data.frame(keyword=str,car=ifelse(t1,1,0),
school=ifelse(t2,1,0))
df <- rbind(df,newRow)
return(df)
}
}