dA有这样的数据表
id group startPoints endPoints
1 A 4, 20, 50, 63, 8, 25, 60, 78
1 A 120, 300, 231, 332
1 B 500, 550
1 B 650, 800 700, 820
1 C 830, 900, 950 850, 920, 970
我想要获得的是获得特定组中长度(EndPoint - StartPoint
)的SUM / MEAN /等,但是无法使用sapply
我的目标是获得表单的结果:
Group SUM
A 177
B 120
C 60
我正在尝试结合两件事
lengths <- strsplit(as.character(table$endPoints), ",", fixed=TRUE)
和
y <- factor(table$group)
tapply(lengths, y, sum)
但是我被困住了,无法让它发挥作用。
添加样本数据
structure(list(id = c(1L, 1L, 1L, 1L, 1L), group = structure(c(1L,
1L, 2L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
startPoints = structure(c(2L, 1L, 3L, 4L, 5L), .Label = c("120,300,",
"4,20,50,63,", "500,", "650,800,", "830,900,950,"), class = "factor"),
endPoints = structure(c(4L, 1L, 2L, 3L, 5L), .Label = c("231,332,",
"550,", "700,820,", "8,25,60,78", "850,920,970,"), class = "factor")),
.Names = c("id", "group", "startPoints", "endPoints"), class = "data.frame",
row.names = c(NA, -5L))
答案 0 :(得分:3)
根据您的要求,这与{{1}}完全无关,但这是使用我的“splitstackshape”软件包中的sapply
的一种方法。
首先,将数据拆分为半长格式:
concat.split.multiple
计算“endPoints”和“startPoints”之间的差异:
library(splitstackshape)
mydf2 <- concat.split.multiple(mydf, split.cols = c("startPoints", "endPoints"),
seps = ",", direction = "long")
使用mydf2$diffs <- mydf2$endPoints - mydf2$startPoints
head(mydf2)
# id group .id time startPoints endPoints diffs
# 1 1 A 1 1 4 8 4
# 2 1 A 2 1 120 231 111
# 3 1 B 1 1 500 550 50
# 4 1 B 2 1 650 700 50
# 5 1 C 1 1 830 850 20
# 6 1 A 1 2 20 25 5
(或aggregate
,或data.table
或您最喜欢的聚合函数来计算您想要的任何内容。
tapply
答案 1 :(得分:1)
或者更多'手动',如果你的数据框是xx
,那么分割的endPoints成为单独的元素,找出每行的长度
endPoints = strsplit(as.character(xx$endPoints), ",", fixed=TRUE)
startPoints = strsplit(as.character(xx$startPoints), ",", fixed=TRUE)
len = sapply(endPoints, length)
使用长度扩展原始数据框,取消列出以前压缩的元素
yy = cbind(xx[rep(seq_len(nrow(xx)), len), c("id", "group")],
startPoints=as.integer(unlist(startPoints)),
endPoints=as.integer(unlist(endPoints)))
之后aggregate
是你的朋友。
aggregate(endPoints - startPoints ~ group, yy, sum)