考虑到r中的四列,频率计数

时间:2017-11-16 15:58:13

标签: r merge count

目前我正在尝试计算数据帧序列集的频率。

A  B
1  a
1  b
1  c
2  a
2  b
2  c

我有这个数据框,我想计算另一个数据框的“B”频率看起来像这样

C  D
1  a
1  a
1  b
1  b
2  b
2  c
2  c

正如您所看到的行数不同,因此数据表(计数)不起作用。我想在频率计数完成后看起来像这样

a  b  freq
1  a   2
1  b   2
1  c   0
2  a   0
2  b   1
2  c   2

正如你所看到的那样,它计算了所有频率甚至是0,因为在某些群组上没有数据。

感谢任何有帮助的人!

3 个答案:

答案 0 :(得分:2)

使用mergeaggregate

df2$freq = 1
df = merge(df1,aggregate(freq~.,df2,length),by.x = c('A','B'),by.y = c('C','D'),all.x = T)
df[is.na(df)] = 0
df
  A B freq
1 1 a    2
2 1 b    2
3 1 c    0
4 2 a    0
5 2 b    1
6 2 c    2

更多信息

aggregate(freq~.,df2,length)
  C D freq
1 1 a    2
2 1 b    2
3 2 b    1
4 2 c    2

数据输入

df1
  A B
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c

df2
  C D
1 1 a
2 1 a
3 1 b
4 1 b
5 2 b
6 2 c
7 2 c

答案 1 :(得分:1)

df1_rows = Reduce(paste, df1)
df2_rows = Reduce(paste, df2)    
data.frame(df1, freq = sapply(df1_rows, function(x) sum(df2_rows %in% x)),
           row.names = NULL)
#  A B freq
#1 1 a    2
#2 1 b    2
#3 1 c    0
#4 2 a    0
#5 2 b    1
#6 2 c    2

数据

df1 = data.frame(A = c(1L, 1L, 1L, 2L, 2L, 2L),
                 B = c("a", "b", "c", "a", "b", "c"))

df2 = data.frame(C = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
                 D = c("a", "a", "b", "b", "b", "c", "c"))

答案 2 :(得分:1)

这看起来是一个如何在不丢失缺失级别的情况下将频率制表成两个因素的问题。

这是dplyr解决方案。这假设dfAB与示例数据一样,不包含重复项(dfAB可与expand.grid的输出互换,如果您还没有数据中的级别组合帧)

library(dplyr)
dfAB %>%
  # need at least one non-joining variable to tell matches from non-matches 
  left_join(mutate(dfCD, dummy = 1), by = c("A" = "C", "B" = "D")) %>% 
  group_by(A, B) %>%
  summarize(freq = sum(dummy, na.rm = TRUE))

输出:

# A tibble: 6 x 3
# Groups:   A [?]
      A     B  freq
  <dbl> <chr> <dbl>
1     1     a     2
2     1     b     2
3     1     c     0
4     2     a     0
5     2     b     1
6     2     c     2

(如果dfAB中有重复项,请在加入前向链中添加distinct次调用)