Question

我有一张Excel表格，其中包含美国每个县每个行业的就业人数。

看起来像这样：

County   Industry  Employees
a        1         49
a        2         1
b        1         4
b        2         19
...

我想计算每个县的就业Herfindahl-Hirschman index（HHI）。我用的是R. 给出一些数字，计算HHI很容易：

hhi <- function(x) {
  # calculate sum
  total <- sum(x)

  # calculate share
  share <- x*100/total

  # add
  return(sum(share^2))

}

因此，例如，县1的HHI为9608（= 98 ^ 2 + 2 ^ 2），县2的HHI为7127。

但是如何使用该县的HHI创建新列？

Answer 1

您可以使用dplyr：

library(dplyr)
df %>% group_by(County) %>% mutate(HHI = sum((Employees/sum(Employees) * 100)^2))

# Source: local data frame [4 x 4]
# Groups: County [2]

#   County Industry Employees      HHI
#   <fctr>    <int>     <int>    <dbl>
# 1      a        1        50 9615.532
# 2      a        2         1 9615.532
# 3      b        1         4 7126.654
# 4      b        2        19 7126.654

或等效地，使用data.table：

setDT(df)[, HHI := sum((Employees/sum(Employees) * 100)^2), County][]

使用您自己的自定义函数hhi，因为它调用的所有函数都是矢量化的，您可以直接将它与mutate一起使用：

df %>% group_by(County) %>% mutate(HHI = hhi(Employees))

或：

setDT(df)[, HHI := hhi(Employees), County][]

Answer 2

我们可以使用ave中的base R（未使用套餐）

df1$HHI <- with(df1, ave(Employees, County, FUN = hhi))

根据功能创建新列

2 个答案: