按ID选择匹配度最高的行

时间:2018-08-21 18:35:32

标签: r dataframe aggregate

我有一个像这样的数据框:

df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
                 torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))

,我希望我的代码为每个id选择重复次数更多的torre,或者为torre选择最后一个id(如果没有重复次数更多的代码)比其他的要差,所以我得这样的新数据帧:

df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))

4 个答案:

答案 0 :(得分:2)

您可以使用Predicate<Object, String>[] branchingPredicates = ...; KStream<Object, String>[] branchingStreams = kStream.branch(branchingPredicates); for (int branchingIndex = 0; branchingIndex < branchingStreams.length; branchingIndex++) { branchingStreams[branchingIndex].map((k,v) -> { ... }).to(specificKafkaTopic); }

aggregate

对此功能的完整解释有些复杂,但是大多数工作是由aggregate(torre ~ id, data=df, FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1)) ) 参数完成的。在这种情况下,我们正在制作一个函数,该函数获取每个FUN=的频率计数,以递增的顺序对其进行排序,然后使用torre获取最后一个频率计数并取其名称。 tail(, 1)函数然后针对每个ID分别应用此函数。

答案 1 :(得分:1)

您可以使用dplyr包进行此操作:按idtorre分组以计算每种torre / id组合的出现次数,然后仅按id分组,然后选择最后一次出现的torre,其组内频率最高。

library(dplyr)
df %>% 
group_by(id,torre) %>% 
mutate(n=n()) %>% 
group_by(id) %>% 
filter(n==max(n)) %>%
slice(n()) %>% 
select(-n)
     id torre
  <dbl> <chr>
1     1     a
2     2     a
3     3     q
4     4     c

答案 2 :(得分:1)

使用软件包的方法:

ALTER PROCEDURE [dbo].[calculohoras] 
    @fecha DATETIME, 
    @codigo VARCHAR(10) 
AS
BEGIN

    SET NOCOUNT ON; 

    DECLARE @fechainicial DATETIME;
    DECLARE @fechafinal DATETIME;
    DECLARE @horas DECIMAL(6,2);
    DECLARE @createdat VARCHAR(255);
    DECLARE @updatedat VARCHAR(255);
    DECLARE @codigoemp VARCHAR(255);

    SET @codigoemp = (SELECT id 
                        FROM nempleados 
                        WHERE codempleado = @codigo);

    SELECT @fechainicial = MIN(fecha), @fechafinal = MAX(fecha)
    FROM Reloj_3216R.[dbo].Marcas 
    WHERE Codigo = @codigo
      AND CAST(fecha AS date) = CAST(@fecha AS date);

    IF @@ROWCOUNT = 0
      RETURN;

    IF EXISTS (SELECT * 
               FROM horas 
               WHERE codempleado = @codigo 
                 AND CAST(fecha AS date) = CAST(@fecha AS date))
      RETURN
    ELSE
    BEGIN
        SET @updatedat = GETDATE();
        SET @createdat = GETDATE();
        SET @horas = 
        (
            DATEDIFF(MINUTE,@fechafinal, @fechainicial)/60.0
        )
        --INSERT INTO horas (codempleado,fecha,horas,created_at,updated_at) VALUES(@codigoemp,@fechafinal,@horas,@createdat,@updatedat);
    END
END

给出:

library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]

以及两个可能的替代方案:

   id torre
1:  1     a
2:  2     a
3:  3     q
4:  4     c

答案 3 :(得分:1)

还有另一个dplyr解决方案,这次使用add_count()而不是mutate()

df %>%
  add_count(id, torre) %>% 
  group_by(id) %>% 
  filter(n==max(n)) %>% 
  slice(n()) %>% 
  select(-n)

# A tibble: 4 x 2
# Groups:   id [4]
     id torre
  <dbl> <fct>
1    1. a    
2    2. a    
3    3. q    
4    4. c