请耐心等待我,我来自Python背景,我仍在学习R中的字符串操作。
好吧,让我说我有一个长度为100的字符串,随机有A,B,C或D字母:
> df<-c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
> df
[1]"ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD"
我想做以下两件事:
1)生成一个'.txt'文件,该文件由上面字符串的20个长度的子部分组成,每个子部分都在前一个字母后面开头,在它上面的行上有自己的唯一名称,如下所示:
NAME1
ABCBDBDBCBABABDBCBCB
NAME2
BCBDBDBCBABABDBCBCBD
NAME3
CBDBDBCBABABDBCBCBDB
NAME4
BDBDBCBABABDBCBCBDBD
......等等
2)获取生成的列表并从中包含另一个具有相同确切子串的列表,唯一的区别是A,B,C或D中的一个或两个与另一个A,B,C的变化,或D(仅限这四个字母中的任何一个)。
所以,这个:
NAME1
ABCBDBDBCBABABDBCBCB
会变成这样:
NAME1.1
ABBBDBDBCBDBABDBCBCB
如您所见,第三个位置的“C”变为“B”,第11位的“A”变为“D”,这些变化的字母之间没有隐含的关系。纯粹是随意的。
我知道这是一个令人费解的问题,但就像我说的那样,我仍在学习R中的基本文本和字符串操作。
提前致谢。
答案 0 :(得分:4)
我尝试将其分解为多个简单步骤,希望您可以从中学到一些技巧:
# Random data
df<-c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
n<-10 # Number of cuts
set.seed(1)
# Pick n random numbers between 1 and the length of string-20
nums<-sample(1:(nchar(df)-20),n,replace=TRUE)
# Make your cuts
cuts<-sapply(nums,function(x) substring(df,x,x+20-1))
# Generate some names
nams<-paste0('NAME',1:n)
# Make it into a matrix, transpose, and then recast into a vector to get alternating names and cuts.
names.and.cuts<-c(t(matrix(c(nams,cuts),ncol=2)))
# Drop a file.
write.table(names.and.cuts,'file.txt',quote=FALSE,row.names=FALSE,col.names = FALSE)
# Pick how many changes are going to be made to each cut.
changes<-sample(1:2,n,replace=2)
# Pick that number of positions to change
pos.changes<-lapply(changes,function(x) sample(1:20,x))
# Find the letter at each position.
letter.at.change.pos<-lapply(pos.changes,function(x) substring(df,x,x))
# Make a function that takes any letter, and outputs any other letter from c(A-D)
letter.map<-function(x){
# Make a list of alternate letters.
alternates<-lapply(x,setdiff,x=c('A','B','C','D'))
# Pick one of each
sapply(alternates,sample,size=1)
}
# Find another letter for each
letter.changes<-lapply(letter.at.change.pos,letter.map)
# Make a function to replace character by position
# Inefficient, but who cares.
rep.by.char<-function(str,pos,chars){
for (i in 1:length(pos)) substr(str,pos[i],pos[i])<-chars[i]
str
}
# Change every letter at pos.changes to letter.changes
mod.cuts<-mapply(rep.by.char,cuts,pos.changes,letter.changes,USE.NAMES=FALSE)
# Generate names
nams<-paste0(nams,'.1')
# Use the matrix trick to alternate names.Drop a file.
names.and.mod.cuts<-c(t(matrix(c(nams,mod.cuts),ncol=2)))
write.table(names.and.mod.cuts,'file2.txt',quote=FALSE,row.names=FALSE,col.names = FALSE)
此外,您可以使用rep.by.char
和strsplit
代替replace
功能,而不是这样:
mod.cuts<-mapply(function(x,y,z) paste(replace(x,y,z),collapse=''),
strsplit(cuts,''),pos.changes,letter.changes,USE.NAMES=FALSE)
答案 1 :(得分:4)
创建子串的文本文件
n <- 20 # length of substrings
starts <- seq(nchar(df) - 20 + 1)
v1 <- mapply(substr, starts, starts + n - 1, MoreArgs = list(x = df))
names(v1) <- paste0("NAME", seq_along(v1), "\n")
write.table(v1, file = "filename.txt", quote = FALSE, sep = "",
col.names = FALSE)
随机替换一个或两个字母(A-D):
myfun <- function() {
idx <- sample(seq(n), sample(1:2, 1))
rep <- sample(LETTERS[1:4], length(idx), replace = TRUE)
return(list(idx = idx, rep = rep))
}
new <- replicate(length(v1), myfun(), simplify = FALSE)
v2 <- mapply(function(x, y, z) paste(replace(x, y, z), collapse = ""),
strsplit(v1, ""),
lapply(new, "[[", "idx"),
lapply(new, "[[", "rep"))
names(v2) <- paste0(names(v2), ".1")
write.table(v2, file = "filename2.txt", quote = FALSE, sep = "\n",
col.names = FALSE)
答案 2 :(得分:2)
一种方式,尽管很慢:
Rgames> foo<-paste(sample(c('a','b','c','d'),20,rep=T),sep='',collapse='')
Rgames> bar<-matrix(unlist(strsplit(foo,'')),ncol=5)
Rgames> bar
[,1] [,2] [,3] [,4] [,5]
[1,] "c" "c" "a" "c" "a"
[2,] "c" "c" "b" "a" "b"
[3,] "b" "b" "a" "c" "d"
[4,] "c" "b" "a" "c" "c"
现在,您可以选择随机索引并使用sample(c('a','b','c','d'),1)
替换所选位置。对于“真正的”随机性,我甚至不会强制改变 - 如果你新绘制的字母与原始字母相同,那就这样吧。
像这样:
ibar<-sample(1:5,4,rep=T) # one random column number for each row
for ( j in 1: 4) bar[j,ibar[j]]<-sample(c('a','b','c','d'),1)
然后,如有必要,使用paste
答案 3 :(得分:2)
问题的第一部分:
df <- c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
nstrchars <- 20
count<- nchar(df)-nstrchars
length20substrings <- data.frame(length20substrings=sapply(1:count,function(x)substr(df,x,x+20)))
# to save to a text file. I chose not to include row names or a column name in the .txt file file
write.table(length20substrings,"length20substrings.txt",row.names=F,col.names=F)
第二部分:
# create a function that will randomly pick one or two spots in a string and replace
# those spots with one of the other characters present in the string:
changefxn<- function(x){
x<-as.character(x)
nc<-nchar(as.character(x))
id<-seq(1,nc)
numchanges<-sample(1:2,1)
ids<-sample(id,numchanges)
chars2repl<-strsplit(x,"")[[1]][ids]
charspresent<-unique(unlist(strsplit(x,"")))
splitstr<-unlist(strsplit(x,""))
if (numchanges>1) {
splitstr[id[1]] <- sample(setdiff(charspresent,chars2repl[1]),1)
splitstr[id[2]] <- sample(setdiff(charspresent,chars2repl[2]),1)
}
else {splitstr[id[1]] <- sample(setdiff(charspresent,chars2repl[1]),1)
}
newstr<-paste(splitstr,collapse="")
return(newstr)
}
# try it out
changefxn("asbbad")
changefxn("12lkjaf38gs")
# apply changefxn to all the substrings from part 1
length20substrings<-length20substrings[seq_along(length20substrings[,1]),]
newstrings <- lapply(length20substrings, function(ii)changefxn(ii))