在更改字符串部分元素Ax之后,我想找到该字符串的重复索引,如下面在很长的字符串向量中所解释的。考虑:
D <- data.frame(string=c("A 4 B 15 C 7","A 13 B 17 C 7","A 3 Ax 1 B 15 C 7","A 12 Ax 1 B 17 C 7","A 24 B 15 C 4","A 32 B 13 C 10","A 12 Ax 1 B 24 D 1","A 12 Ax 1 B 24 D 1","A 13 B 24 D 1"))
# string
"A 4 B 15 C 7"
"A 13 B 17 C 7"
"A 3 Ax 1 B 15 C 7"
"A 12 Ax 1 B 17 C 7"
"A 24 B 15 C 4"
"A 32 B 13 C 10"
"A 12 Ax 1 B 24 D 1"
"A 12 Ax 1 B 24 D 1"
"A 13 B 24 D 1"
我现在通过其Ax增加每个A并删除Ax,所以我将重复:
l <- strsplit(as.character(D$string), ' ')
# check which list parts contain 'Ax'
i <- sapply(l, function(v) any(v == 'Ax'))
# for those that contain 'Ax' increase the second number with 1
# and remove the 'Ax 1' part
l[i] <- lapply(l[i], function(v) {
v[2] <- as.character(as.numeric(v[2]) + 1);
v[-c(which(v == 'Ax') + 0:1)]
})
# check which are duplicates
k<-data.frame(k=as.integer(duplicated(l)))
k1<-data.frame(k=as.integer(duplicated(l,fromLast = TRUE)))
这里解决了:Finding isotopes of find corresponding string with two differences exactly predifined
但是我现在如何检查原始数据框中的哪个位置D
我有Ax值和A对应的重复匹配?我的想法如下:h
表示D
h<-c(0,0,1,1,0,0,1,1,0)
inds <- lapply(1:length(h[h==1 & (k==1 | k1==1)]), function(x) which(paste0(l[h==1 & (k==1 | k1==1)], collapse = NULL) %in% as.vector(l[h==1 & (k==1 | k1==1)][x])))
inds<-unlist(inds)
inds:
1
2
3
3
我可以通过在原始data.frame中插入inds来检查inds
是否正确
X<-data.frame(A=A[h==0 & (k==1 | k1==1),1][inds],Ax=A[h==1 & (k==1 | k1==1),1])
其中第一列具有A值,第二列具有相应的Ax值
但除此之外需要花费大量时间似乎并不总是给出正确的索引,如果inds
有多个匹配,这又不起作用了怎么办?
是否有人知道如何改进/使其正确并处理多个匹配?最后我想要inds
向量之类的东西(如果一行中有多个匹配则为列表),这样我就知道(如果有重复)我的A字符串与其Ax对应的位置原始data.frame D
。
任何其他方法来查找A字符串的相应索引及其相应的Ax字符串也是受欢迎的。
有人可以帮助我吗?
非常感谢。
答案 0 :(得分:0)
这是解决问题的长期低效(可轻松改进)方式。
D%>%
mutate(string = as.character(string),spaces = str_count( . ,' ')+1)%>%
magrittr::set_colnames(c("strings", "spaces"))%>%
separate(col = strings, into = paste("col", 1:max(.$spaces), sep = ""))%>%
mutate(col2 = as.numeric(col2), col4 = as.numeric(col4))%>%
mutate(col2 = ifelse(col4 == 1 & col3 == "Ax", col2+col4, col2),
col4 = ifelse(col3 == "Ax", "", col4))%>%
mutate_all(funs(replace(., is.na(.)|. == "Ax", NA)))%>%
select(-spaces)%>%
unite(col = "D",colnames(.), sep = " ")%>%
mutate(D = gsub(" NA |NA", "", D))
D
1 A 4 B 15 C 7
2 A 13 B 17 C 7
3 A 4 B 15 C 7
4 A 13 B 17 C 7
5 A 24 B 15 C 4
6 A 32 B 13 C 10
7 A 13 B 24 D 1
8 A 13 B 24 D 1
9 A 13 B 24 D 1
获取“重复”的索引。字符串问题,只需使用duplicate()
返回 TRUE / FALSE 值的向量,然后按which()
来获取索引,只需使用:
data%>%
duplicated()%>%
which()
因此,您的重复数据将通过以下方式显示:
D%>%
mutate(string = as.character(string),spaces = str_count( . ,' ')+1)%>%
magrittr::set_colnames(c("strings", "spaces"))%>%
separate(col = strings, into = paste("col", 1:max(.$spaces), sep = ""))%>%
mutate(col2 = as.numeric(col2), col4 = as.numeric(col4))%>%
mutate(col2 = ifelse(col4 == 1 & col3 == "Ax", col2+col4, col2),
col4 = ifelse(col3 == "Ax", "", col4))%>%
mutate_all(funs(replace(., is.na(.)|. == "Ax", NA)))%>%
select(-spaces)%>%
unite(col = "D",colnames(.), sep = " ")%>%
mutate(D = gsub(" NA |NA", "", D))%>%
duplicated()%>%
which()
[1] 3 4 8 9
我知道这个过程看起来相当繁琐,所以我会看看我是否可以在以后浓缩它。