通过矩阵dplyr将分组数据帧相乘

时间:2017-07-20 06:38:23

标签: r data.table dplyr

我的问题:

我有两个数据框,一个用于行业,一个用于职业。它们由国家嵌套,并显示就业。

我还有一个索引矩阵,它显示了每个行业中每个职业的权重。

我想使用行业就业和一致性矩阵在职业数据框架中创建一个新的就业编号。

我已经制作了我的问题的虚拟版本 - 我认为很清楚:

更新

我已经解决了这个问题,但我想知道是否有更优雅的解决方案?实际上,我的维度是7个州* 200个行业* 350个职业,它变得相当数据饥饿

# create industry data frame

set.seed(12345)

ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
                 industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
                 emp = rnorm(18,20,2))


# create occupation data frame

Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
                     occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
                     emp = rnorm(15,10,1))

# create concordance matrix

Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()

# name cols in the concordance matrix 

colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)



# solution 

Ind_combined <- cbind(Ind_Occ_Conc, ind_df)

Ind_combined <- Ind_combined %>%
  group_by(State) %>% 
  mutate(Occ1 = emp*Occ1,
         Occ2 = emp*Occ2,
         Occ3 = emp*Occ3,
         Occ4 = emp*Occ4,
         Occ5 = emp*Occ5
         )

Ind_combined <- Ind_combined %>% 
  gather(key = "occupation",
         value = "emp2",
         -State,
         -industry,
         -emp
         )

Ind_combined <- Ind_combined %>%
  group_by(State, occupation) %>%
  summarise(emp2 = sum(emp2))


Occ_df <- left_join(Occ_df,Ind_combined)

我的解决方案看起来效率很低,有没有更好/更快的方法呢?

另外 - 我不太清楚如何达到这个目的 - 但是预期的结果将是添加到Occ_df的另一个名为emp2的列,这将来自Ind_df emp列和Ind_Occ_Conc。我试图在占领1中执行此操作,基本上Ind_Occ_Conc包含权重,结果是加权平均值。

1 个答案:

答案 0 :(得分:0)

我不确定你想用总和(Ind $ emp * Occ1_coeff)行做什么,但也许这就是你要找的东西:

# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
        summarize(rez = sum(emp))

# Then do some computations on Ind, which is a N element vector (one for each state)
# ...

# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)

最终输出将在新列中具有Ind值:所有a的一个值,b的一个值和c的一个值。

希望它会有所帮助;)