我正在处理时态数据,这个问题与先前的代码written有关。
library(data.table)
Aggregated <- fread("
act1_1 act1_2 act1_3 act1_4 act1_5
2 1 3 2 6
1 2 2 1 1
1 4 2 2 3
")
cols <- names(Aggregated)
n <- length(cols)
vi <- CJ(rn = 1:nrow(Aggregated), len = 2:5, start = 1:n)[
, end := start + len - 1L][
end <= n]
dl <- melt(setDT(Aggregated)[, rn := .I], id.vars = "rn", variable.name = "pos",
variable.factor = TRUE)[
, pos := as.integer(pos)][]
result <- dl[vi, on = .(rn, pos >= start, pos <= end),
.(rn, values = toString(value), position = toString(cols[x.pos])),
by = .EACHI, nomatch = 0L][
, .(freq = .N), by = .(values, position)]
fin<-result[order(nchar(values), values)]
fin[,summed:=sum(freq), by=values]
fin$sm <- ifelse(duplicated(fin$values) == T, NA, fin$summed)
fin<-fin[!is.na(fin$sm), ]
我的问题是如何创建另一个返回频率开始和结束的列:
freq summed Start End
5: 2,1 act1_1, act1_2 1 2 act1_1 act1_4
6: 2,1 act1_3, act1_4 1
7: 2,2 act1_2, act1_3 1 2
8: 2,2 act1_3, act1_4 1
起始编号示例(不是来自汇总df):一对数字的起始点
freq summed Start End
5: 2, 1 act1_1, act1_2 1 1 act1_1
6: 2, 2 act1_1 act1_4 1 1 NA
7: 2, 3 act1_3, act1_4 1 1 NA
8: 2, 4 act1_2, act1_3 1 1 NA
9: 2, 7 act1_3, act1_4 1 1 NA
10: 3, 7 act1_5, act1_6 1 1 act1_5
11: 4, 1 act1_5, act1_6 1 2 act1_4
11: 4, 1 act1_7, act1_8 1 NA NA
12: 4 ,2 act1_4, act1_5 1 1 NA
终点编号示例一对数字的终点
freq summed Start End
5: 2, 1 act1_1, act1_2 1 1 act1_4
6: 2, 2 act1_1 act1_4 1 1 NA
7: 2, 3 act1_3, act1_4 1 1 NA
8: 2, 4 act1_2, act1_3 1 1 NA
9: 2, 7 act1_3, act1_4 1 1 NA
10: 3, 7 act1_5, act1_6 1 1 act1_6
11: 4, 1 act1_5, act1_6 1 2 act1_8
11: 4, 1 act1_7, act1_8 1 NA NA
12: 4 ,2 act1_4, act1_5 1 1 NA
最终输出:
freq summed Start End
5: 2, 1 act1_1, act1_2 1 1 act1_1 act1_4
6: 2, 2 act1_1 act1_4 1 1 NA NA
7: 2, 3 act1_3, act1_4 1 1 NA NA
8: 2, 4 act1_2, act1_3 1 1 NA NA
9: 2, 7 act1_3, act1_4 1 1 NA NA
10: 3, 7 act1_5, act1_6 1 1 act1_5 act1_6
11: 4, 1 act1_5, act1_6 1 2 act1_4 act1_8
11: 4, 1 act1_7, act1_8 1 NA NA NA
12: 4 ,2 act1_4, act1_5 1 1 NA NA
答案 0 :(得分:1)
仍然不太清楚,但是也许这可能是一个起点,使用底数R和一些dplyr
:
# first we need in the dataset a new column, used subsequently
fin$value_short <- substr(fin$values,1,1)
library(dplyr)
# now the dplyr chain
aggregated <- fin %>%
# add a column to use subsequently
mutate(value_short = substr(values,1,1)) %>%
# split the position column
separate_rows(position, sep =', ') %>%
# select useful columns
select(position,value_short) %>%
# group
group_by(value_short) %>%
# calculate the start and the end
summarise(start = paste0('act1_', min(as.numeric(substr(position,6,6)))),
end = paste0('act1_', max(as.numeric(substr(position,6,6)))))
结果如下:
aggregated
# A tibble: 4 x 3
value_short start end
<chr> <chr> <chr>
1 1 act1_1 act1_5
2 2 act1_1 act1_5
3 3 act1_3 act1_5
4 4 act1_2 act1_5
# Now, let's join the aggregated to the original data:
fin_aggr <- fin %>% left_join(aggregated)
# remove the dupes
fin_aggr$start <- ifelse(duplicated(fin_aggr$value_short), NA, fin_aggr$start)
fin_aggr$end <- ifelse(duplicated(fin_aggr$value_short), NA, fin_aggr$end)
# remove the useless column
fin_aggr <- fin_aggr[,-6]
结果如下:
fin_aggr
values position freq summed sm start end
1 1, 1 act1_4, act1_5 1 1 1 act1_1 act1_5
2 1, 2 act1_1, act1_2 1 1 1 <NA> <NA>
3 1, 3 act1_2, act1_3 1 1 1 <NA> <NA>
4 1, 4 act1_1, act1_2 1 1 1 <NA> <NA>
5 2, 1 act1_1, act1_2 1 2 2 act1_1 act1_5
6 2, 2 act1_2, act1_3 1 2 2 <NA> <NA>
7 2, 3 act1_4, act1_5 1 1 1 <NA> <NA>
8 2, 6 act1_4, act1_5 1 1 1 <NA> <NA>
9 3, 2 act1_3, act1_4 1 1 1 act1_3 act1_5
10 4, 2 act1_2, act1_3 1 1 1 act1_2 act1_5
11 1, 2, 2 act1_1, act1_2, act1_3 1 1 1 <NA> <NA>
12 1, 3, 2 act1_2, act1_3, act1_4 1 1 1 <NA> <NA>
13 1, 4, 2 act1_1, act1_2, act1_3 1 1 1 <NA> <NA>
14 2, 1, 1 act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
15 2, 1, 3 act1_1, act1_2, act1_3 1 1 1 <NA> <NA>
16 2, 2, 1 act1_2, act1_3, act1_4 1 1 1 <NA> <NA>
17 2, 2, 3 act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
18 3, 2, 6 act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
19 4, 2, 2 act1_2, act1_3, act1_4 1 1 1 <NA> <NA>
20 1, 2, 2, 1 act1_1, act1_2, act1_3, act1_4 1 1 1 <NA> <NA>
21 1, 3, 2, 6 act1_2, act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
22 1, 4, 2, 2 act1_1, act1_2, act1_3, act1_4 1 1 1 <NA> <NA>
23 2, 1, 3, 2 act1_1, act1_2, act1_3, act1_4 1 1 1 <NA> <NA>
24 2, 2, 1, 1 act1_2, act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
25 4, 2, 2, 3 act1_2, act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
26 1, 2, 2, 1, 1 act1_1, act1_2, act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
27 1, 4, 2, 2, 3 act1_1, act1_2, act1_3, act1_4, act1_5 1 1 1 <NA> <NA>
28 2, 1, 3, 2, 6 act1_1, act1_2, act1_3, act1_4, act1_5 1 1 1 <NA> <NA>