在data.frame中组合变量

时间:2019-12-14 06:17:57

标签: r sum grouping

我制作了以下xtable,其中有11个因素用于中美洲考古遗址的时期。我希望将某些期间合并为“ CL” =“ CL” +“ EC” +“ LC”另外“ F” =“ MF” +“ LF”

我还想重新编码距离,以表示(km)“ 0” =“ 0km”,“ 1” =“ 1-2 Km”,“ 2” =“ 3-4km”,“ 3 = 3 -5km“

我似乎只能更改名称,而不能将与之相关的数据保留在原始表中

理想情况下,它看起来像这样,但是时间周期和距离显示正确。它们按时间顺序按时间顺序排列,尽管此处只显示了其中的几个。

dput(bomxtab3)
structure(list(Period = structure(c(10L, 5L, 6L, 1L, NA, 8L, 
7L, 3L, 9L, 2L, 4L, 10L, 5L, 6L, 1L, NA, 8L, 7L, 3L, 9L, 2L, 
4L, 10L, 5L, 6L, 1L, NA, 8L, 7L, 3L, 9L, 2L, 4L, 10L, 5L, 6L, 
1L, NA, 8L, 7L, 3L, 9L, 2L, 4L), .Label = c("EF", "MF", "LF", 
"TF", "CL", "EC", "LC", "ET", "LT", "AZ"), class = "factor"), 
    Distance = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
    4L, 4L, 4L, 4L, 4L), .Label = c("0", "1", "2", "3"), class = "factor"), 
    Population = c(242391L, 1774L, 1980L, 315L, 0L, 9898L, 1430L, 
    5355L, 5010L, 903L, 2420L, 83725L, 7953L, 3320L, 175L, 200L, 
    13514L, 2370L, 8943L, 15018L, 4909L, 17107L, 55994L, 5546L, 
    1105L, 0L, 0L, 16110L, 1405L, 4105L, 5905L, 335L, 21563L, 
    19636L, 4815L, 1670L, 0L, 0L, 12811L, 525L, 3950L, 8563L, 
    3845L, 8562L), `bomxtab2$Population/x` = c(0.603343903859653, 
    0.0883114297092792, 0.245201238390093, 0.642857142857143, 
    0, 0.189134962643074, 0.24956369982548, 0.239565159039055, 
    0.145234230055659, 0.0903722978382706, 0.0487392250060421, 
    0.208402821683352, 0.395908004778973, 0.411145510835913, 
    0.357142857142857, 1, 0.258230944146141, 0.413612565445026, 
    0.400080526103879, 0.435354823747681, 0.491293034427542, 
    0.344537984371224, 0.139376621049121, 0.276085225009956, 
    0.136842105263158, 0, 0, 0.307836355645577, 0.245200698080279, 
    0.183644253567754, 0.17117926716141, 0.0335268214571657, 
    0.434282606944333, 0.0488766534078746, 0.239695340501792, 
    0.206811145510836, 0, 0, 0.244797737565207, 0.0916230366492147, 
    0.176710061289312, 0.24823167903525, 0.384807846277022, 0.172440183678402
    )), row.names = c(NA, -44L), class = "data.frame")
> 

1 个答案:

答案 0 :(得分:1)

如果我理解正确,那么您想重新编码您拥有的变量。有很多关于此的信息,例如herehere。 我不太确定对周期变量是否完全正确,所以这就是为什么我创建了一个新变量。我使用tidyverse包。

library(tidyverse)
df <- df %>% mutate(Period1 = case_when(
  Period %in% c("CL", "EC", "LC") ~ "CL",
  Period %in% c("MF","LF") ~ "F",
  TRUE ~ as.character(Period)) ,
  Distance1 = recode(Distance,
                     `0` = "0km",
                     `1` = "1-2km",
                     `2` = "2-3km",
                     `3` = "3-5km",)
  )

使用%>%创建一个管道(来自dplyr)。如果将case用作if语句,则如果Xis为true,则执行Y。在这里,我检查向量中列Period中是否有值,如果为true,则将其重新编码。 随机抽样,结果为:

   Period Distance Population bomxtab2$Population/x Period1 Distance1
1      ET        0       9898            0.18913496      ET       0km
2      EF        3          0            0.00000000      EF     3-5km
3      ET        2      16110            0.30783636      ET     2-3km
4      MF        0        903            0.09037230       F       0km
5      ET        3      12811            0.24479774      ET     3-5km
6      MF        2        335            0.03352682       F     2-3km
7      LC        2       1405            0.24520070      CL     2-3km
8      TF        3       8562            0.17244018      TF     3-5km
9      CL        0       1774            0.08831143      CL       0km
10     ET        1      13514            0.25823094      ET     1-2km