在对数据帧内和跨数据帧的组进行调节时,非唯一实例的频率

时间:2013-04-17 01:38:53

标签: r data-binding unique aggregation conditional-operator

我正在分析包含个人在某一年工作的公司信息的就业数据,每年都是一个单独的数据框架。

我希望能够快速识别在特定年份为多家公司工作的个人,以及在一年内为多家公司工作过的个人。我的目标是计算某个公司在年内(单个数据框架)和多年内“退出”(员工更换公司)的次数。

数据框的结构如下:

year1 <- data.frame(individual=c("1", "2", "3", "4", "2", "6", "7", "3", "9", "10"),
                firm=c("A", "B", "C", "D", "A", "C", "D", "B", "B", "C"))

year2 <- data.frame(individual=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"),
                firm=c("A", "B", "D", "D", "A", "C", "D", "A", "B", "C"))

我相当肯定如何在一年内通过搜索个人和公司之间的所有非独特关联来做到这一点,但是对于如何在多个数据对象/年中执行此操作感到茫然。同样,我对公司“退出”的频率感兴趣而不是特定的个人。

我的理想产出是每家公司员工总数的频率/比例如下:

exit(withinyear)_byfirm
exit(betweenyear)_byfirm

1 个答案:

答案 0 :(得分:0)

数数,而不是比例:

within <- function(y) {
  # A vector of length > 1 in the aggregate function means that the person has
  # changed jobs.
  # `[` ignores the value 0 if there are other values present, and returns a
  # zero-length vector if not.  Often a source of confusion, but perfect here.
  table(levels(y$firm)[aggregate(firm~individual, data=y,
                                 function(x) {z<- unique(x)                 
                                              if(length(z) > 1) head(z, -1) else 0})$firm])
}

between <- function(year1, year2) {
  # Last place worked in year1
  y1 <- rbind(do.call(rbind, by(year1, year1$individual, FUN=tail, 1)))

  # First place worked in year2
  y2 <- rbind(do.call(rbind, by(year2, year2$individual, FUN=head, 1)))

  # Combine these and look for duplicate individuals with the prior function
  y <- rbind(y1, y2)
  within(y)
}

结果:

> within(year1)

B C 
1 1 

> within(year2)
character(0)

> between(year1, year2)

A B 
1 1