如何比较一列中的两个相邻字符串并遍历所有字符串?

时间:2014-07-26 15:25:21

标签: r loops dataframe lapply

我用来找到两个字符串(found on the R-help website)之间差异的函数:

X  <- "abcdefg" ; Y <- "aBcDEfg" 
diff <- function(X,Y){
  X0 <- unlist(strsplit(X,split=""))  ## Nasty but necessary! 
  Y0 <- unlist(strsplit(Y,split=""))  ## ... 
  ix <- which(X0 != Y0) 
  cbind(ix,X0[ix],Y0[ix])   
}
diff(X,Y)
     ix         
[1,] "2" "b" "B"
[2,] "4" "d" "D"
[3,] "5" "e" "E"

我需要比较数据框状态中的列:

grint <- 
c("45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", 
"45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "<5CCBC:4B", 
"<5CCBC:4B", "<5CCBC:4B", "<<CCBC::B", "<<GGBG::E", "<<GGBG::E", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "CC11B1CCE", 
"CC11B1CCE", "CC55B1CCE", "55CCBC44B", "55CCBC44B", "55CCBC44B", 
"55CCBC44B", "55CCBC44B", "55CCBC44B", "G1CCBC1GB", "G1CCBC1GB", 
"G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", 
"G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", 
"G1CCBC1GB", "G1CCBC1GB", "91CCBC11B", "01CCBC11B", "01CCBC11B", 
"01CCBC11B", "01CCBC11B", "11CCBC11B", "11CCBC11B", "11CCBC11B", 
"15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B", 
"15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B", 
"55CCBC11B", "55CCBC11B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", 
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B"
)

我需要连续比较列中的值,这意味着它可以找到发生的差异并比较两个字符串。例如,grint[9]grint[10]不同并显示它。 我尝试使用lapply函数来完成一个循环,以应用于每个字符串以查找每个更改,但我失败了:

a <-grint[i]
b <-grint[i+1]

lapply(grint,diff(a,b))

错误:

Error in match.fun(FUN) : 
  'diff(a, b)' is not a function, character or symbol

所以我想知道我应该怎么做?非常感谢!

3 个答案:

答案 0 :(得分:1)

我认为你需要match,它返回第一场比赛的索引。删除第一个元素

> ( m <- match(unique(x), x)[-1] )
 [1] 10 13 14 16 45 47 48 54 68 69 73 76 86

将比赛与前一个元素进行比较,我们可以发现存在差异。

> cbind(x[m-1], x[m])
      [,1]        [,2]       
 [1,] "45CCBC44B" "<5CCBC:4B"
 [2,] "<5CCBC:4B" "<<CCBC::B"
 [3,] "<<CCBC::B" "<<GGBG::E"
 [4,] "<<GGBG::E" "55CCBC41B"
 [5,] "55CCBC41B" "CC11B1CCE"
 [6,] "CC11B1CCE" "CC55B1CCE"
 [7,] "CC55B1CCE" "55CCBC44B"
 [8,] "55CCBC44B" "G1CCBC1GB"
 [9,] "G1CCBC1GB" "91CCBC11B"
[10,] "91CCBC11B" "01CCBC11B"
[11,] "01CCBC11B" "11CCBC11B"
[12,] "11CCBC11B" "15CCBC11B"
[13,] "15CCBC11B" "55CCBC11B"

答案 1 :(得分:0)

我不完全确定我理解这个问题。如果您正在尝试查找列/变量中存在差异的位置?你可以做到这一点。

  • 将您的列转换为字符向量。

我已经在这里取了你的前17个条目并手动将它们放入矢量'x'

x<-c("45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B",     "45CCBC44B", "45CCBC44B", "45CCBC44B", "<5CCBC:4B", "<5CCBC:4B", "<5CCBC:4B", "<<CCBC::B", "<<GGBG::E", "<<GGBG::E", "55CCBC41B", "55CCBC41B")

然后你可以简单地询问该向量的每个元素是否与前一个进行滞后比较的元素相同:

lagged.x <- c(NA,head(x,-1))
x == lagged.x


[1]    NA  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE   TRUE FALSE  TRUE

如果存在差异,那将标识为“FALSE”。如果这是你感兴趣的?

答案 2 :(得分:0)

比下面的答案更好,只需在评论中做@Andrie建议的diff(grint[-1], grint[-length(grint)])

这是两种略有不同的方法,可以处理不同长度的字符串。如果所有字符串的长度相同,则不需要str_pad

中的stringr
samplestrings <- c("apple", "apple", "banana", "banana", "apple", "apple","aslkd;fa")
library(stringr)
samplestrings <- str_pad(samplestrings, max(nchar(samplestrings)) , side="right")

  X0 <- unlist(strsplit(samplestrings,split=""))  ## Nasty but necessary!
  Y0 <- unlist(strsplit(c(samplestrings[-1], rep(" ", max(nchar(samplestrings)))),split="")) ## ...
  ix <- which(X0[-length(X0):-(length(X0)-max(nchar(samplestrings))+1)] != 
              Y0[-length(X0):-(length(X0)-max(nchar(samplestrings))+1)])
  cbind(ix,X0[ix],Y0[ix])

      ix          
 [1,] "9"  "a" "b"
 [2,] "10" "p" "a"
 [3,] "11" "p" "n"
 [4,] "12" "l" "a"
 [5,] "13" "e" "n"
 [6,] "14" " " "a"
 [7,] "25" "b" "a"
 [8,] "26" "a" "p"
 [9,] "27" "n" "p"
[10,] "28" "a" "l"
[11,] "29" "n" "e"
[12,] "30" "a" " "
[13,] "42" "p" "s"
[14,] "43" "p" "l"
[15,] "44" "l" "k"
[16,] "45" "e" "d"
[17,] "46" " " ";"
[18,] "47" " " "f"
[19,] "48" " " "a"

方法2: 在我意识到OP正在寻找什么类型的输出之前,我先写了这个,但如果你想在寻找连续字符串之间的字符差异的过程中创建不匹配的数据帧,我仍然可以使用它。

samplestrings <- c("apple", "apple", "banana", "banana", "apple", "apple","aslkd;fa")
library(stringr) 
# use str_pad to make every string equal in number of characters
samplestrings <- str_pad(samplestrings, max(nchar(samplestrings)) , side="right")

findiffs <- rle(samplestrings)

newdf <- data.frame(index = paste0(cumsum(findiffs$length),"-",cumsum(findiffs$length)+1), 
          firststring = samplestrings[cumsum(findiffs$length)],
          secondstring = samplestrings[cumsum(findiffs$length)+1])

newdf <- newdf[-dim(newdf)[1],] 

  index firststring secondstring
1   2-3    apple        banana  
2   4-5    banana       apple   
3   6-7    apple        aslkd;fa

因此,newdf包含不相同的字符串,然后我们可以使用您使用的方法:

  X0 <- unlist(strsplit(as.character(newdf$firststring),split=""))  ## Nasty but necessary!
  Y0 <- unlist(strsplit(as.character(newdf$secondstring),split=""))  ## ...
  ix <- which(X0 != Y0)
  cbind(ix,X0[ix],Y0[ix]) 

     ix          
 [1,] "1"  "a" "b"
 [2,] "2"  "p" "a"
 [3,] "3"  "p" "n"
 [4,] "4"  "l" "a"
 [5,] "5"  "e" "n"
 [6,] "6"  " " "a"
 [7,] "9"  "b" "a"
 [8,] "10" "a" "p"
 [9,] "11" "n" "p"
[10,] "12" "a" "l"
[11,] "13" "n" "e"
[12,] "14" "a" " "
[13,] "18" "p" "s"
[14,] "19" "p" "l"
[15,] "20" "l" "k"
[16,] "21" "e" "d"
[17,] "22" " " ";"
[18,] "23" " " "f"
[19,] "24" " " "a"