我有一个像这样的数据框:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))
,我希望我的代码为每个id
选择重复次数更多的torre
,或者为torre
选择最后一个id
(如果没有重复次数更多的代码)比其他的要差,所以我得这样的新数据帧:
df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))
答案 0 :(得分:2)
您可以使用Predicate<Object, String>[] branchingPredicates = ...;
KStream<Object, String>[] branchingStreams = kStream.branch(branchingPredicates);
for (int branchingIndex = 0; branchingIndex < branchingStreams.length; branchingIndex++) {
branchingStreams[branchingIndex].map((k,v) -> { ... }).to(specificKafkaTopic);
}
:
aggregate
对此功能的完整解释有些复杂,但是大多数工作是由aggregate(torre ~ id, data=df,
FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)
参数完成的。在这种情况下,我们正在制作一个函数,该函数获取每个FUN=
的频率计数,以递增的顺序对其进行排序,然后使用torre
获取最后一个频率计数并取其名称。 tail(, 1)
函数然后针对每个ID分别应用此函数。
答案 1 :(得分:1)
您可以使用dplyr
包进行此操作:按id
和torre
分组以计算每种torre
/ id
组合的出现次数,然后仅按id
分组,然后选择最后一次出现的torre
,其组内频率最高。
library(dplyr)
df %>%
group_by(id,torre) %>%
mutate(n=n()) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
id torre
<dbl> <chr>
1 1 a
2 2 a
3 3 q
4 4 c
答案 2 :(得分:1)
使用data.table软件包的方法:
ALTER PROCEDURE [dbo].[calculohoras]
@fecha DATETIME,
@codigo VARCHAR(10)
AS
BEGIN
SET NOCOUNT ON;
DECLARE @fechainicial DATETIME;
DECLARE @fechafinal DATETIME;
DECLARE @horas DECIMAL(6,2);
DECLARE @createdat VARCHAR(255);
DECLARE @updatedat VARCHAR(255);
DECLARE @codigoemp VARCHAR(255);
SET @codigoemp = (SELECT id
FROM nempleados
WHERE codempleado = @codigo);
SELECT @fechainicial = MIN(fecha), @fechafinal = MAX(fecha)
FROM Reloj_3216R.[dbo].Marcas
WHERE Codigo = @codigo
AND CAST(fecha AS date) = CAST(@fecha AS date);
IF @@ROWCOUNT = 0
RETURN;
IF EXISTS (SELECT *
FROM horas
WHERE codempleado = @codigo
AND CAST(fecha AS date) = CAST(@fecha AS date))
RETURN
ELSE
BEGIN
SET @updatedat = GETDATE();
SET @createdat = GETDATE();
SET @horas =
(
DATEDIFF(MINUTE,@fechafinal, @fechainicial)/60.0
)
--INSERT INTO horas (codempleado,fecha,horas,created_at,updated_at) VALUES(@codigoemp,@fechafinal,@horas,@createdat,@updatedat);
END
END
给出:
library(data.table) setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]
以及两个可能的dplyr替代方案:
id torre
1: 1 a
2: 2 a
3: 3 q
4: 4 c
答案 3 :(得分:1)
还有另一个dplyr
解决方案,这次使用add_count()
而不是mutate()
:
df %>%
add_count(id, torre) %>%
group_by(id) %>%
filter(n==max(n)) %>%
slice(n()) %>%
select(-n)
# A tibble: 4 x 2
# Groups: id [4]
id torre
<dbl> <fct>
1 1. a
2 2. a
3 3. q
4 4. c