我有一系列按顺序标记的批记录。有时批次重叠。
x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
x
1 1
2 1
3 1/2
4 2
5 3
6 4
7 5/4
8 5
我想找到不重叠的批次集合并标记这些期间。批量&#34; 1/2&#34;包括&#34; 1&#34;和&#34; 2&#34;所以它不是唯一的。当批次=&#34; 3&#34;它不包含在任何以前的批次中,因此它开始一个新的期间。我在处理合并批次时遇到困难,否则这将是直截了当的。结果将是:
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
我的经验是更多功能性编程范例,所以我知道我这样做的方式非常非R。我正在R中寻找干净简单的方法。任何帮助表示赞赏。
这里是我的非R代码,但它非常笨重而且无法扩展。
x <- c("1","1","1/2","2","3","4","5/4","5")
p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period
for (i in 1:length(x)){
#check for "/", split and add to temp
if (grepl("/", x[i])){
z <- strsplit(x[i], "/") #split character
z <- unlist(z) #convert to vector
temp <- c(temp, z, x[i]) #add to temp vector for comparison
}
#check if x in temp
if(x[i] %in% temp){
temp <- append(temp, x[i]) #add to search vector
rl <- rl + 1 #increase length
} else {
period <- append(period, rep(p, rl)) #add to period vector
p <- p + 1 #increase period count
temp <- NULL #reset
rl <- 1 #reset
}
}
#add last batch
rl <- length(x) - length(period)
period <- append(period, rep(p,rl))
df <- data.frame(x,period)
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
答案 0 :(得分:2)
R具有功能范例影响,因此您可以使用Map
和Reduce
解决此问题。请注意,此解决方案遵循您组合所见值的方法。如果您假设批次编号是连续的,则可以采用更简单的方法,就像在您的示例中一样。
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))
data.frame(x,period=p)
x period 1 1 1 2 1 1 3 1/2 1 4 2 1 5 3 2 6 4 3 7 5/4 3 8 5 3
这样做首先计算看到的值的累积并集。然后,它在此映射以确定之前未见过任何当前值的位置。 (或者,第二步可以包含在reduce中,但如果不支持解构,这将更加冗长。)累积总和根据交叉点空出的次数提供“周期”数字。
如果您确实假设批号是连续的,那么您可以执行以下操作
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))
data.frame(x,period=p)
对于相同的结果(此处不再重复)。
答案 1 :(得分:1)
稍微短一些:
x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
spl=unlist(strsplit(x$x[i], "/"))
currentBatch=min(spl)
if (currentBatch<prevBatch) { stop("Error in sequence") }
if (currentBatch>prevBatch)
period=period+1;
x$period[i]=period;
prevBatch=max(spl)
}
x
答案 2 :(得分:0)
以下是使用tidyr
将数据拆分为两列的原始文件,因此更易于使用:
# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)
library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)
现在df
看起来像:
> df
x x1 x2
1 1 1 NA
2 1 1 NA
3 1/2 1 2
4 2 2 NA
5 3 3 NA
6 4 4 NA
7 5/4 5 4
8 5 5 NA
现在循环可以简单得多:
period <- 1
for(i in 1:nrow(df)){
period <- c(period,
# test if either x1 or x2 of row i are in any x1 or x2 above it
ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])),
period[i], # if so, repeat the terminal value
period[i] + 1)) # else append the terminal value + 1
}
# rebuild df with x and period, which loses its extra initializing value here
df <- data.frame(x = df$x, period = period[2:length(period)])
结果df
:
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3