更改分层数据格式以计算平均值

时间:2014-05-27 23:49:31

标签: r

我有以下数据:

   ID  city gender total  0-4  5-9  10-14
1  A1 city1      M   120   30   50     40
2  A1 city1      F   100   20   45     35
3  B1 city1      F   130   35   50     45
4  B1 city1      M   150   30   60     60
5  C1 city2      M   140   40   50     50
6  C1 city2      F   135   35   45     55
7  D1 city2      M   145   40   55     50
8  D1 city2      F   165   65   35     65
9  E1 city2      M   155   50   50     55
10 E1 city2      F   160   50   45     65

这里0-4岁,5-9岁,10-14岁是年龄组。

如何以下列形式轻松更改此数据:

  ID  city total male female meanage   male%
1 A1 city1   220  120    100    7.57      54
2 B1 city1   280  150    130     8.5      53
3 C1 city2   ---  ---    ---    ----    ----
4 D1 city2    --  ---    ---     ---   -----
5 E1 city2    -- ----   ----   ----- -------

感谢。

2 个答案:

答案 0 :(得分:0)

你真的想做很多复杂的转换。我看不到一步一步完成这一切的简单方法。首先,这里的样本数据来自

dd<-data.frame(
    ID = paste0(rep(LETTERS[1:5],each=2),"1"), 
    city = rep(c("city1", "city2"), c(4,6)), 
    gender = c("M", "F", "F", "M", "M", "F", "M", "F", "M", "F"), 
    total = c(120, 100, 130, 150, 140, 135, 145, 165, 155, 160), 
    X0.4 = c(30, 20, 35, 30, 40, 35, 40, 65, 50, 50), 
    X5.9 = c(50, 45, 50, 60, 50, 45, 55, 35, 50, 45), 
    X10.14 = c(40, 35, 45, 60, 50, 55, 50, 65, 55, 65)
)

现在我们需要在开始重塑数据之前计算总加权年龄

tt<-transform(dd, 
    wage=2*X0.4 + 7*X5.9 + 12*X10.14,
    X0.4=NULL,
    X5.9=NULL,
    X10.14=NULL
)

现在我们将重塑数据。它还不够宽,而且#34;因此,我们将融化它,并扩大它,以便我们为每个性别的总和和工资列。

library(reshape2)
mm<-melt(tt, id.vars=c("ID","city","gender"))
cc<-dcast(mm, ID+city~gender+variable)

现在我们拥有每行所需的所有数据,因此我们进行最后一组转换以获取我们想要的列

transform(cc, total=F_total+M_total,
    male=M_total,
    female=F_total,
    meanage = (F_wage+M_wage)/(M_total+F_total),
    maleperc = M_total/(M_total+F_total) * 100,
    F_total=NULL, M_total=NULL,
    F_wage=NULL, M_wage=NULL
)

哪个产生

  ID  city total male female  meanage maleperc
1 A1 city1   220  120    100 7.568182 54.54545
2 B1 city1   280  150    130 7.714286 53.57143
3 C1 city2   275  140    135 7.545455 50.90909
4 D1 city2   310  145    165 7.161290 46.77419
5 E1 city2   315  155    160 7.317460 49.20635

所以我不确定我同意你的平均年龄B1。如果您正在使用其他公式计算该值,请随时更新代码。

答案 1 :(得分:0)

这或多或少会对数据的前几行产生影响。 (我认为你对B1的meanage的计算是关闭的,但是根据年龄范围计算平均值对我来说似乎并不合适。)

library('dplyr')
library('reshape2')

yourdata <- data.frame(ID = c("A1", "A1", "B1", "B1", "C1", "C1"),
                       city = c(rep("city1", 4), "city2", "city2"), 
                       gender = c("M", "F", "F", "M", "M", "F"), 
                       total = c(120, 100, 130, 150, 140, 135), 
                       X0_4 = c(30, 20, 35, 30, 40, 35), 
                       X5_9 = c(50, 45, 50, 60, 50, 45), 
                       X10_14 = c(40, 35, 45, 60, 50, 55))

# total counts of each age group for each ID
age_counts <- yourdata %>%
  group_by(ID) %>%
  summarize(X0_4 = sum(X0_4), X5_9 = sum(X5_9), X10_14 = sum(X10_14))

# use dcast to get 'male' and 'female' columns
data_wide <- dcast(yourdata, ID + city ~ gender, value.var = 'total')

# merge back with age counts
data_full  <- merge(data_wide, age_counts, by = 'ID')

# final table
data_full %>%
  group_by(ID) %>%
  mutate(total = sum(F, M), 
         meanage = ((X0_4 * ((0 + 4) / 2) + X5_9 * ((5 + 9) / 2) + X10_14 * 
                      ((10 + 14) / 2)) / total),
         male_percent = ((M * 100) %/% total)) %>%
  select(ID, city, total, M, F, meanage, male_percent)