我用来找到两个字符串(found on the R-help website)之间差异的函数:
X <- "abcdefg" ; Y <- "aBcDEfg"
diff <- function(X,Y){
X0 <- unlist(strsplit(X,split="")) ## Nasty but necessary!
Y0 <- unlist(strsplit(Y,split="")) ## ...
ix <- which(X0 != Y0)
cbind(ix,X0[ix],Y0[ix])
}
diff(X,Y)
ix
[1,] "2" "b" "B"
[2,] "4" "d" "D"
[3,] "5" "e" "E"
我需要比较数据框状态中的列:
grint <-
c("45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B",
"45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "<5CCBC:4B",
"<5CCBC:4B", "<5CCBC:4B", "<<CCBC::B", "<<GGBG::E", "<<GGBG::E",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "CC11B1CCE",
"CC11B1CCE", "CC55B1CCE", "55CCBC44B", "55CCBC44B", "55CCBC44B",
"55CCBC44B", "55CCBC44B", "55CCBC44B", "G1CCBC1GB", "G1CCBC1GB",
"G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB",
"G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB", "G1CCBC1GB",
"G1CCBC1GB", "G1CCBC1GB", "91CCBC11B", "01CCBC11B", "01CCBC11B",
"01CCBC11B", "01CCBC11B", "11CCBC11B", "11CCBC11B", "11CCBC11B",
"15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B",
"15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B", "15CCBC11B",
"55CCBC11B", "55CCBC11B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B",
"55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B", "55CCBC41B"
)
我需要连续比较列中的值,这意味着它可以找到发生的差异并比较两个字符串。例如,grint[9]
和grint[10]
不同并显示它。
我尝试使用lapply
函数来完成一个循环,以应用于每个字符串以查找每个更改,但我失败了:
a <-grint[i]
b <-grint[i+1]
lapply(grint,diff(a,b))
错误:
Error in match.fun(FUN) :
'diff(a, b)' is not a function, character or symbol
所以我想知道我应该怎么做?非常感谢!
答案 0 :(得分:1)
我认为你需要match
,它返回第一场比赛的索引。删除第一个元素
> ( m <- match(unique(x), x)[-1] )
[1] 10 13 14 16 45 47 48 54 68 69 73 76 86
将比赛与前一个元素进行比较,我们可以发现存在差异。
> cbind(x[m-1], x[m])
[,1] [,2]
[1,] "45CCBC44B" "<5CCBC:4B"
[2,] "<5CCBC:4B" "<<CCBC::B"
[3,] "<<CCBC::B" "<<GGBG::E"
[4,] "<<GGBG::E" "55CCBC41B"
[5,] "55CCBC41B" "CC11B1CCE"
[6,] "CC11B1CCE" "CC55B1CCE"
[7,] "CC55B1CCE" "55CCBC44B"
[8,] "55CCBC44B" "G1CCBC1GB"
[9,] "G1CCBC1GB" "91CCBC11B"
[10,] "91CCBC11B" "01CCBC11B"
[11,] "01CCBC11B" "11CCBC11B"
[12,] "11CCBC11B" "15CCBC11B"
[13,] "15CCBC11B" "55CCBC11B"
答案 1 :(得分:0)
我不完全确定我理解这个问题。如果您正在尝试查找列/变量中存在差异的位置?你可以做到这一点。
我已经在这里取了你的前17个条目并手动将它们放入矢量'x'
x<-c("45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "45CCBC44B", "<5CCBC:4B", "<5CCBC:4B", "<5CCBC:4B", "<<CCBC::B", "<<GGBG::E", "<<GGBG::E", "55CCBC41B", "55CCBC41B")
然后你可以简单地询问该向量的每个元素是否与前一个进行滞后比较的元素相同:
lagged.x <- c(NA,head(x,-1))
x == lagged.x
[1] NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
如果存在差异,那将标识为“FALSE”。如果这是你感兴趣的?
答案 2 :(得分:0)
比下面的答案更好,只需在评论中做@Andrie建议的diff(grint[-1], grint[-length(grint)])
。
这是两种略有不同的方法,可以处理不同长度的字符串。如果所有字符串的长度相同,则不需要str_pad
stringr
samplestrings <- c("apple", "apple", "banana", "banana", "apple", "apple","aslkd;fa")
library(stringr)
samplestrings <- str_pad(samplestrings, max(nchar(samplestrings)) , side="right")
X0 <- unlist(strsplit(samplestrings,split="")) ## Nasty but necessary!
Y0 <- unlist(strsplit(c(samplestrings[-1], rep(" ", max(nchar(samplestrings)))),split="")) ## ...
ix <- which(X0[-length(X0):-(length(X0)-max(nchar(samplestrings))+1)] !=
Y0[-length(X0):-(length(X0)-max(nchar(samplestrings))+1)])
cbind(ix,X0[ix],Y0[ix])
ix
[1,] "9" "a" "b"
[2,] "10" "p" "a"
[3,] "11" "p" "n"
[4,] "12" "l" "a"
[5,] "13" "e" "n"
[6,] "14" " " "a"
[7,] "25" "b" "a"
[8,] "26" "a" "p"
[9,] "27" "n" "p"
[10,] "28" "a" "l"
[11,] "29" "n" "e"
[12,] "30" "a" " "
[13,] "42" "p" "s"
[14,] "43" "p" "l"
[15,] "44" "l" "k"
[16,] "45" "e" "d"
[17,] "46" " " ";"
[18,] "47" " " "f"
[19,] "48" " " "a"
方法2: 在我意识到OP正在寻找什么类型的输出之前,我先写了这个,但如果你想在寻找连续字符串之间的字符差异的过程中创建不匹配的数据帧,我仍然可以使用它。
samplestrings <- c("apple", "apple", "banana", "banana", "apple", "apple","aslkd;fa")
library(stringr)
# use str_pad to make every string equal in number of characters
samplestrings <- str_pad(samplestrings, max(nchar(samplestrings)) , side="right")
findiffs <- rle(samplestrings)
newdf <- data.frame(index = paste0(cumsum(findiffs$length),"-",cumsum(findiffs$length)+1),
firststring = samplestrings[cumsum(findiffs$length)],
secondstring = samplestrings[cumsum(findiffs$length)+1])
newdf <- newdf[-dim(newdf)[1],]
index firststring secondstring
1 2-3 apple banana
2 4-5 banana apple
3 6-7 apple aslkd;fa
因此,newdf
包含不相同的字符串,然后我们可以使用您使用的方法:
X0 <- unlist(strsplit(as.character(newdf$firststring),split="")) ## Nasty but necessary!
Y0 <- unlist(strsplit(as.character(newdf$secondstring),split="")) ## ...
ix <- which(X0 != Y0)
cbind(ix,X0[ix],Y0[ix])
ix
[1,] "1" "a" "b"
[2,] "2" "p" "a"
[3,] "3" "p" "n"
[4,] "4" "l" "a"
[5,] "5" "e" "n"
[6,] "6" " " "a"
[7,] "9" "b" "a"
[8,] "10" "a" "p"
[9,] "11" "n" "p"
[10,] "12" "a" "l"
[11,] "13" "n" "e"
[12,] "14" "a" " "
[13,] "18" "p" "s"
[14,] "19" "p" "l"
[15,] "20" "l" "k"
[16,] "21" "e" "d"
[17,] "22" " " ";"
[18,] "23" " " "f"
[19,] "24" " " "a"