基于字符串和R中字符串顺序的编码

时间:2015-02-05 10:32:59

标签: regex r string grep

我必须编写许多data.frames代码。例如:

tt <- data.frame(V1=c("test1", "test3", "test1", "test4", "wins", "loses"),
             V2=c("someannotation", "othertext", "loads of text including the word winning for the winner and the word losing for the loser", "blablabla", "blablabla", "blablabla"))

tt 
V1       V2
test1    someannotation
test3    othertext
test1    loads of text including the word winning for the winner and the word losing for the loser
test4    blablabla
wins     blablabla
loses    blablabla

编码必须进入新的data.frame,如果跑步者赢或输,我必须编码。如果V1表示wins,则他获胜(如果他输了,则由loses表示)。但是,跑步者有可能赢得或失去部分比赛,test1中的V1表示V2指定。如果winning中的V2一词出现在术语losing之前,那么跑步者将赢得部分比赛(以及副驾驶)。

我试图从这里实现答案元素,以指定哪个字/字符串出现在哪个位置:

find location of character in string

实现如下:

result <- data.frame()
for(i in 1:length(tt[,1])){
  if(grepl("wins", tt[i,1])) result[i,1] <- "wins"
  if(grepl("loses", tt[i,1])) result[i,1] <- "loses"
  if(grepl("test1", tt[i,1])&(which(strsplit(tt[i,2], " ")[[1]]=="winning")>which(strsplit(tt[i,2], " ")[[1]]=="losing"))) result[i,1] <- "loses"
  if(grepl("test1", tt[i,1])&(which(strsplit(tt[i,2], " ")[[1]]=="winning")<which(strsplit(tt[i,2], " ")[[1]]=="losing"))) result[i,1] <- "wins"
}

但是V2列的单元格不包含winninglosing的错误消息:

Error in if (grepl("test1", tt[i, 1]) & (which(strsplit(tt[i, 2], " ")[[1]] ==  : argument is of length zero

是否有人解决了这个问题甚至是复杂的解决方案?感谢任何帮助,谢谢!

修改 正如@grrgrrbla善意地澄清一样,赢得胜利的可能性有两种:一种是V1 == "win",另一种是V2是否包含&#34;赢得&#34;在&#34;失去&#34;之前跑步者也赢了,有两种可能性会丢失:V1 == "loses"V2包含&#34;失败&#34;之前&#34;赢得&#34;。

我的输出应如下所示:

result
  V1
  NA
  NA
  wins
  NA
  wins
  loses

1 个答案:

答案 0 :(得分:0)

您可以尝试(可能不是最简单的解决方案......)创建一个函数,如果满足您的“获胜”条件,则返回“胜利”,如果满足您的“失败”条件,则“失败” NA在其他情况下:

wilo<-function(vec){
    if(grepl("wins|loses",vec[1])){ # if the first variable is "wins" or "loses" you return the value of the first variable
        return(vec[1])
    } else {
        if(grepl("winning|losing",vec[2])){ # if in the second variable, there is winning or losing (actually both need to be in the sentence and are supposed to be so you can just check for one word : grepl("winning",vec[2]) )
            ifelse(gregexpr("winning",vec[2])[[1]]<gregexpr("losing",vec[2])[[1]], # if "winning" is placed before "losing"
                   return("wins"), # return "wins"
                   return("loses")) # else return "loses"
        } else {
            return(NA) # if none of the conditions are fulfilled, return NA
        }
    }
 }

然后,您可以在data.frame的每一行上应用该函数:

apply(tt,1,wilo)
#[1] NA      NA      "wins"  NA      "wins"  "loses"

注意:正如@grrgrrbla所建议的,使用函数gregexpr的替代方法是使用str_locate包中的函数stringr