Question

我有一个面板（横截面时间序列）数据集。对于每个组（由时间ym中的（NAICS2，occ_type）定义），我有很多变量。对于每个变量，我想从该组的每个值中减去该组的第一个（dplyr::first）值。最终，我试图得出每行组的第一个条目的向量之间的欧几里德差（即sqrt（c_1 ^ 2 + ... + c_k ^ 2）。

我能够创建一个列，该列等于每个组的第一个条目：

df2 <- df %>% 
  group_by(ym, NAICS2, occ_type) %>% 
  distinct(ym, NAICS2, occ_type, .keep_all = T) %>% 
  arrange(occ_type, NAICS2, ym) %>% 
  select(group_cols(), ends_with("_scf")) %>% 
  mutate_at(vars(-group_cols(), ends_with("_scf")), 
            list(first = dplyr::first))

然后，我尝试在列表中包括f.diff = . - dplyr::first(.)的变体，但没有一个起作用。我用点表示法搜索了一段时间，以及dplyr时间序列中的第一个和滞后，但是还无法解决。

理想情况下，我先将所有变量都合并为每一行的向量，然后求和。

df2 <- df %>% 
  group_by(ym, NAICS2, occ_type) %>% 
  distinct(ym, NAICS2, occ_type, .keep_all = T) %>% 
  arrange(occ_type, NAICS2, ym) %>% 
  select(group_cols(), ends_with("_scf")) %>% 
  unite(vector, c(-group_cols(), ends_with("_scf")), sep = ',') %>%
# TODO: DISTANCE_BETWEEN_ENTRY_AND_FIRST
  mutate(vector.diff = ???)

我希望输出是一个数字列，其中包含距离度量值，该距离度量值是每个组的行向量与其初始行向量之间的差异。

以下是数据示例：

structure(list(ym = c("2007-01-01", "2007-02-01"), NAICS2 = c(0L, 
0L), occ_type = c("is_middle_manager", "is_middle_manager"), 
    Administration_scf = c(344, 250), Agriculture..Horticulture..and.the.Outdoors_scf = c(11, 
    17), Analysis_scf = c(50, 36), Architecture.and.Construction_scf = c(57, 
    51), Business_scf = c(872, 585), Customer.and.Client.Support_scf = c(302, 
    163), Design_scf = c(22, 17), Economics..Policy..and.Social.Studies_scf = c(7, 
    7), Education.and.Training_scf = c(77, 49), Energy.and.Utilities_scf = c(25, 
    28), Engineering_scf = c(90, 64), Environment_scf = c(19, 
    19), Finance_scf = c(455, 313), Health.Care_scf = c(105, 
    71), Human.Resources_scf = c(163, 124), Industry.Knowledge_scf = c(265, 
    174), Information.Technology_scf = c(467, 402), Legal_scf = c(21, 
    17), Maintenance..Repair..and.Installation_scf = c(194, 222
    ), Manufacturing.and.Production_scf = c(176, 174), Marketing.and.Public.Relations_scf = c(139, 
    109), Media.and.Writing_scf = c(18, 20), Personal.Care.and.Services_scf = c(31, 
    16), Public.Safety.and.National.Security_scf = c(14, 7), 
    Religion_scf = c(0, 0), Sales_scf = c(785, 463), Science.and.Research_scf = c(52, 
    24), Supply.Chain.and.Logistics_scf = c(838, 455), total_scf = c(5599, 
    3877)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), groups = structure(list(ym = c("2007-01-01", 
"2007-02-01"), NAICS2 = c(0L, 0L), occ_type = c("is_middle_manager", 
"is_middle_manager"), .rows = list(1L, 2L)), row.names = c(NA, 
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

如何在几对列之间创建差异？

0 个答案: