如何在几对列之间创建差异?

时间:2019-04-08 18:44:54

标签: r dplyr time-series tidyr

我有一个面板(横截面时间序列)数据集。对于每个组(由时间ym中的(NAICS2,occ_type)定义),我有很多变量。对于每个变量,我想从该组的每个值中减去该组的第一个(dplyr::first)值。 最终,我试图得出每行组的第一个条目的向量之间的欧几里德差(即sqrt(c_1 ^ 2 + ... + c_k ^ 2)。

我能够创建一个列,该列等于每个组的第一个条目:

df2 <- df %>% 
  group_by(ym, NAICS2, occ_type) %>% 
  distinct(ym, NAICS2, occ_type, .keep_all = T) %>% 
  arrange(occ_type, NAICS2, ym) %>% 
  select(group_cols(), ends_with("_scf")) %>% 
  mutate_at(vars(-group_cols(), ends_with("_scf")), 
            list(first = dplyr::first))

然后,我尝试在列表中包括f.diff = . - dplyr::first(.)的变体,但没有一个起作用。我用点表示法搜索了一段时间,以及dplyr时间序列中的第一个和滞后,但是还无法解决。

理想情况下,我先将所有变量都合并为每一行的向量,然后求和。

df2 <- df %>% 
  group_by(ym, NAICS2, occ_type) %>% 
  distinct(ym, NAICS2, occ_type, .keep_all = T) %>% 
  arrange(occ_type, NAICS2, ym) %>% 
  select(group_cols(), ends_with("_scf")) %>% 
  unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
  mutate(vector.diff = ???)

我希望输出是一个数字列,其中包含距离度量值,该距离度量值是每个组的行向量与其初始行向量之间的差异。

以下是数据示例:

structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L, 
0L), occ_type = c("is_middle_manager", "is_middle_manager"), 
    Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11, 
    17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57, 
    51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302, 
    163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7, 
    7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25, 
    28), Engineering_scf = c(90, 64), Environment_scf = c(19, 
    19), Finance_scf = c(455, 313), Health.Care_scf = c(105, 
    71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265, 
    174), Information.Technology_scf = c(467, 402), Legal_scf = c(21, 
    17), Maintenance..Repair..and.Installation_scf = c(194, 222
    ), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139, 
    109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31, 
    16), Public.Safety.and.National.Security_scf = c(14, 7), 
    Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52, 
    24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599, 
    3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01", 
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager", 
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA, 
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

0 个答案:

没有答案