我有一个这样的数据框:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"),
time = c(3.1,3.2,6.5,12.3, 3.2, 3.4),
intensity = c(10, 20, 30, 40, 50, 60))
|ID | time| intensity| |:--|----:|---------:| |A | 3.1| 10| |A | 3.2| 20| |B | 6.5| 30| |B | 12.3| 40| |C | 3.2| 50| |C | 3.4| 60|
当时差小于0.3时,我想通过ID 仅聚合值(和强度)。首先,我计算了这个时差:
df.2 <- df %>%
group_by(ID) %>%
mutate(time.diff = max(time) - min(time))
...导致:
|ID | time| intensity| time.diff| |:--|----:|---------:|---------:| |A | 3.1| 10| 0.1| |A | 3.2| 20| 0.1| |B | 6.5| 30| 5.8| |B | 12.3| 40| 5.8| |C | 3.2| 50| 0.2| |C | 3.4| 60| 0.2|
为了清楚起见,我希望得到的结果是:
|ID | time| intensity| time.diff| |:--|----:|---------:|---------:| |A | 3.15| 30| 0.1| |B | 6.5| 30| 5.8| |B | 12.3| 40| 5.8| |C | 3.3| 110| 0.2|
现在时间是综合观测的平均值,而强度是它们的总和。 ID&#34; B&#34;保持两个观察,因为它的时间差大于0.3。我尝试过使用dplyr,但总结总会丢掉其中一个&#34; B&#34;的观察结果,我想保留它们,我不知道该怎么做条件 _group_by_。
我感谢你的任何想法!!
答案 0 :(得分:3)
data.table
library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
sum(intensity)), ID])
# ID time intensity time.diff
#1: A 3.15 30 0.1
#2: B 6.50 30 5.8
#3: B 12.30 40 5.8
#4: C 3.30 110 0.2
或使用dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
intensity=ifelse(indx, sum(intensity), intensity),
time=ifelse(indx, mean(time), time)) %>%
filter(!indx|row_number()==1) %>%
select(-indx)
# ID time intensity time.diff
#1 A 3.15 30 0.1
#2 B 6.50 30 5.8
#3 B 12.30 40 5.8
#4 C 3.30 110 0.2
答案 1 :(得分:3)
data.table
解决方案的另一种变体:
setDT(df)[, time.diff := max(time) - min(time), by = ID
][, if (time.diff <= 0.3)
.(time = mean(time), intensity = sum(intensity))
else .SD, by = .(ID, time.diff)]
# ID time.diff time intensity
# 1: A 0.1 3.15 30
# 2: B 5.8 6.50 30
# 3: B 5.8 12.30 40
# 4: C 0.2 3.30 110
答案 2 :(得分:1)
# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})
# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)
# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
time = mean(time),
intensity = sum(intensity),
time.diff = mean(time.diff))[2:5]
答案 3 :(得分:1)
使用<p:commandLink action="#{verifyCredentials.save(klasse, modul)}">
:
sqldf
输出:
library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))<0.3
UNION
SELECT ID, df.time, df.intensity, df2.dif
FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))>0.3) as df2
LEFT JOIN df USING (ID)')