删除R中字符串中位置的字符?

时间:2012-08-21 00:42:52

标签: string r character

我正在寻找一种方法来删除R中字符串中某些位置的字符。例如,如果我们有一个字符串"1,2,1,1,2,1,1,1,1,2,1,1",我想删除第三,第四,第七和第八位置。该操作将生成字符串:"1,1,2,1,1,1,1,2,1,1"

不幸的是,使用strsplit将字符串分解为列表不是一种选择,因为我使用的字符串长度超过100万个字符。考虑到我有大约2,500个字符串,它可能需要一段时间。

或者,找到一种用空字符串""替换字符的方法也会达到同样的目的 - 我想。看看这一思路,我遇到了这个StackOverflow帖子:

R: How can I replace let's say the 5th element within a string?

不幸的是,建议的解决方案难以有效地概括,并且对于要删除的2000个位置的列表,每个输入字符串大约需要60秒:

subchar2 = function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = paste(string, substr(inputstring, (memory+1), (num-1)), sep = "")
    memory = num
}
string = paste(string, substr(inputstring,(memory+1), nchar(inputstring)),sep = "")
return(string)
}

调查问题后,我发现了一段代码,似乎用"-"替换某些位置的字符:

subchar <- function(string, pos) {
        for(i in pos) {
            string <- gsub(paste("^(.{", i-1, "}).", sep=""), "\\1-", string)
        }
        return(string)
}

我还不太了解正则表达式(但是),但我强烈怀疑这些行中的某些内容比第一个代码解决方案的时间要好得多。不幸的是,当pos中的值变高时,这个subchar函数似乎会破坏:

> test = subchar(data[1], 257)
Error in gsub(paste("^(.{", i - 1, "}).", sep = ""), "\\1-", string) :
invalid regular expression '^(.{256}).', reason 'Invalid contents of {}'

我还在考虑尝试使用SQL将字符串数据读入表中,但我希望有一个优雅的字符串解决方案。 R中的SQL实现看起来相当复杂。

有什么想法吗? 谢谢!

3 个答案:

答案 0 :(得分:3)

使用scan()阅读它们。您可以将分隔符设置为“,”和what =“a”。您可以scan一次nlines=1一个“行”,如果它是textConnection,则“管道”将“记住”上次读取时的位置。

x <- paste( sample(0:1, 1000, rep=T), sep=",")
xin <- textConnection(x)

x995 <- scan(xin, sep=",", what="a", nmax=995)
# Read 995 items
x5 <- scan(xin, sep=",", what="a", nmax=995)
# Read 5 items

这是一个5“行”的插图

> x <- paste( rep( paste(sample(0:1, 50, rep=T), collapse=","),  5),  collapse="\n")
> str(x)
 chr "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0\n1,0,0,0,0,1,0,0,1,1,1,0,1,"| __truncated__
> xin <- textConnection(x)
> x1 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x2 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x3 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x4 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x5 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x6 <- scan(xin, sep=",", what="a", nlines=1)
Read 0 items
> length(x1)
[1] 50
> length(x1[-c(3,4,7,8)])
[1] 46
> paste(x1, collapse=",")
[1] "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0"
> 

答案 1 :(得分:3)

如果您使用strsplit

fixed = TRUE的速度要快十倍以上。粗略推断,处理2,500个1,000,000个逗号分隔整数的字符串需要2分钟多一点。

N <- 1000000
x <- sample(0:1, N, replace = TRUE)
s <- paste(x, collapse = ",")

# this is a vector of 10 strings
M <- 10
S <- rep(s, M)

system.time(y <- strsplit(S, split = ","))
# user  system elapsed 
# 6.57    0.00    6.56 
system.time(y <- strsplit(S, split = ",", fixed = TRUE))
# user  system elapsed 
# 0.46    0.03    0.50

这几乎比使用扫描快3倍:

system.time(scan(textConnection(S), sep=",", what="a"))
# Read 10000000 items
# user  system elapsed 
# 1.21    0.09    1.42

答案 2 :(得分:2)

一个快速修复是删除for循环中的粘贴

subchar3<-function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = c(string,substr(inputstring, (memory+1), (num-1)))
    memory = num
}
string = paste(c(string, substr(inputstring,(memory+1), nchar(inputstring))),collapse = "")
return(string)
}
data<-paste(sample(letters,100000,replace=T),collapse='')
remove<-sample(1:nchar(data),200)
remove<-remove[order(remove)]
s2<-subchar2(data,remove)
s3<-subchar3(data,remove)
identical(s2,s3)
#[1] TRUE

> library(rbenchmark)
> benchmark(subchar2(data,remove),subchar3(data,remove),replications=10)
                    test replications elapsed relative user.self sys.self
1 subchar2(data, remove)           10   43.64 40.78505     39.97      1.9
2 subchar3(data, remove)           10    1.07  1.00000      1.01      0.0
  user.child sys.child
1         NA        NA
2         NA        NA