在向量中查找唯一的字符串集,其中向量元素可以是多个字符串

时间:2016-02-05 18:52:45

标签: r

我有一系列按顺序标记的批记录。有时批次重叠。

x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
    x
1   1
2   1
3 1/2
4   2
5   3
6   4
7 5/4
8   5

我想找到不重叠的批次集合并标记这些期间。批量&#34; 1/2&#34;包括&#34; 1&#34;和&#34; 2&#34;所以它不是唯一的。当批次=&#34; 3&#34;它不包含在任何以前的批次中,因此它开始一个新的期间。我在处理合并批次时遇到困难,否则这将是直截了当的。结果将是:

    x period
1   1      1
2   1      1
3 1/2      1
4   2      1
5   3      2
6   4      3
7 5/4      3
8   5      3

我的经验是更多功能性编程范例,所以我知道我这样做的方式非常非R。我正在R中寻找干净简单的方法。任何帮助表示赞赏。

这里是我的非R代码,但它非常笨重而且无法扩展。

x <- c("1","1","1/2","2","3","4","5/4","5")

p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period

for (i in 1:length(x)){

    #check for "/", split and add to temp
    if (grepl("/", x[i])){
        z <- strsplit(x[i], "/") #split character
        z <- unlist(z) #convert to vector
        temp <- c(temp, z, x[i]) #add to temp vector for comparison
    }

    #check if x in temp
    if(x[i] %in% temp){
        temp <- append(temp, x[i]) #add to search vector
        rl <- rl + 1 #increase length
    } else {
        period <- append(period, rep(p, rl)) #add to period vector
        p <- p + 1 #increase period count
        temp <- NULL #reset
        rl <- 1 #reset
    }
}

#add last batch

rl <- length(x) - length(period)
period <- append(period, rep(p,rl))

df <- data.frame(x,period)

> df
    x period
1   1      1
2   1      1
3 1/2      1
4   2      1
5   3      2
6   4      3
7 5/4      3
8   5      3

3 个答案:

答案 0 :(得分:2)

R具有功能范例影响,因此您可以使用MapReduce解决此问题。请注意,此解决方案遵循您组合所见值的方法。如果您假设批次编号是连续的,则可以采用更简单的方法,就像在您的示例中一样。

x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))

data.frame(x,period=p)
    x period
1   1      1
2   1      1
3 1/2      1
4   2      1
5   3      2
6   4      3
7 5/4      3
8   5      3

这样做首先计算看到的值的累积并集。然后,它在此映射以确定之前未见过任何当前值的位置。 (或者,第二步可以包含在reduce中,但如果不支持解构,这将更加冗长。)累积总和根据交叉点空出的次数提供“周期”数字。

如果您确实假设批号是连续的,那么您可以执行以下操作

x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))

data.frame(x,period=p)

对于相同的结果(此处不再重复)。

答案 1 :(得分:1)

稍微短一些:

x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
   spl=unlist(strsplit(x$x[i], "/"))
   currentBatch=min(spl)
   if (currentBatch<prevBatch) { stop("Error in sequence") }
   if (currentBatch>prevBatch)
      period=period+1;

   x$period[i]=period;

   prevBatch=max(spl)
}
x

答案 2 :(得分:0)

以下是使用tidyr将数据拆分为两列的原始文件,因此更易于使用:

# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)

library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)

现在df看起来像:

> df
    x x1 x2
1   1  1 NA
2   1  1 NA
3 1/2  1  2
4   2  2 NA
5   3  3 NA
6   4  4 NA
7 5/4  5  4
8   5  5 NA

现在循环可以简单得多:

period <- 1
for(i in 1:nrow(df)){
  period <- c(period,
              # test if either x1 or x2 of row i are in any x1 or x2 above it
              ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])), 
                     period[i],        # if so, repeat the terminal value
                     period[i] + 1))   # else append the terminal value + 1
}

# rebuild df with x and period, which loses its extra initializing value here 
df <- data.frame(x = df$x, period = period[2:length(period)])

结果df

> df
    x period
1   1      1
2   1      1
3 1/2      1
4   2      1
5   3      2
6   4      3
7 5/4      3
8   5      3