通过Sapply从子集上的函数创建多个变量:结果不是真实的?

时间:2016-03-10 18:46:59

标签: r function subset sapply

简介

我有宽幅格式的纵向数据,用于衡量公司每年的销售总额。由此,我想为每个公司创建一组新的变量 - 市场份额 - 每年在数据中。完整的数据集太大而无法以漫长而笨拙的方式执行此操作,因此我尝试通过在子集上运行函数(即标记每年销售数据的列),使用sapply来完成此操作。

然而,结果似乎并没有产生真实的'变量,因为它们出现在打印(head())中但不是现实中(names())。我的代码有问题吗?

# SAMPLE DATA
agyrw <- structure(list(company = c(28, 128, 22, 72, 62, 65, 132, 89, 46, 105), value.1993 = c(79272, 35850, 2124, 32, 0, 0, 0, 26359, 0, 0), value.1994 = c(103974, 10219, 31432, 0, 0, 0, 3997, 469, 0, 0)), .Names = c("company", "value.1993", "value.1994"), row.names = c(9L, 42L, 1L, 30L, 22L, 28L, 51L, 34L, 20L, 40L), class = "data.frame")

agyrw2 <- agyrw     # FOR A LATER COMPARISON

agyrw
#      company value.1993 value.1994
#         28      79272     103974
#        128      35850      10219
#         22       2124      31432
#         72         32          0
#         62          0          0
#         65          0          0
#        132          0       3997
#         89      26359        469
#         46          0          0
#        105          0          0

笨拙的长路

# SUM TOTAL VALUE BY YEAR
total.1993 <- sum(agyrw$value.1993)
total.1994 <- sum(agyrw$value.1994)

# CALCULATE THE MARKET SHARE FOR EACH IMPORTER, BY YEAR
agyrw$share.1993 <- agyrw$value.1993 / total.1993
agyrw$share.1994 <- agyrw$value.1994 / total.1994

# FORMAT THE MARKET SHARE VARIABLE TO ONLY FOUR DECIMAL PLACES
agyrw$share.1993 <- format(round(agyrw$share.1993, 4), nsmall = 4)
agyrw$share.1994 <- format(round(agyrw$share.1994, 4), nsmall = 4)

# RECONVERT THE MARKET SHARE VARIABLE BACK INTO NUMERIC
agyrw$share.1993 <- as.numeric(agyrw$share.1993)
agyrw$share.1994 <- as.numeric(agyrw$share.1994)

# VIEW
agyrw
#       company value.1993 value.1994 share.1993 share.1994
#          28      79272     103974     0.5519     0.6927
#         128      35850      10219     0.2496     0.0681
#          22       2124      31432     0.0148     0.2094
#          72         32          0     0.0002     0.0000
#          62          0          0     0.0000     0.0000
#          65          0          0     0.0000     0.0000
#         132          0       3997     0.0000     0.0266
#          89      26359        469     0.1835     0.0031
#          46          0          0     0.0000     0.0000
#         105          0          0     0.0000     0.0000

Parsimonious Attempt

agyrw2$share <- sapply(agyrw2[,2:3], function(x) {
    total <- sum(x)
    share <- as.numeric(format(round(x/total, 4), nsmall = 4))
    return(share)
    }
       )

# VIEW
agyrw2
#      company value.1993 value.1994 share.value.1993 share.value.1994
#          28      79272     103974           0.5519           0.6927
#         128      35850      10219           0.2496           0.0681
#          22       2124      31432           0.0148           0.2094
#          72         32          0           0.0002           0.0000
#          62          0          0           0.0000           0.0000
#          65          0          0           0.0000           0.0000
#         132          0       3997           0.0000           0.0266
#          89      26359        469           0.1835           0.0031
#          46          0          0           0.0000           0.0000
#         105          0          0           0.0000           0.0000

问题 初步检查后,一切看起来都很好。在函数上使用agyrw2的{​​{1}}的结果与由笨拙的代码创建的sapply的结果相同(除了稍微不同的列名称)。

但是当我尝试在agyrw中调用任何新创建的变量时,它们似乎不存在,尽管在打印出来时显示出来。例如,调用列名只会产生一个agyrw2列:

agyrw2$share

如何重写函数以便它实际在数据框中生成新列?

2 个答案:

答案 0 :(得分:1)

怎么样:

agyrw2 <- cbind(agyrw2,do.call(cbind, lapply(agyrw2[,2:3], function(x) {
    total <- sum(x)
    share <- as.numeric(format(round(x/total, 4), nsmall = 4))
    return(share)
    })))

或简单地说:

agyrw2$share.1993 <- as.numeric(format(round(agyrw2$value.1993 / sum(agyrw2$value.1993), 4), nsmall = 4))
agyrw2$share.1994 <- as.numeric(format(round(agyrw2$value.1994 / sum(agyrw2$value.1994), 4), nsmall = 4))

答案 1 :(得分:1)

问题是share实际上是一个2列矩阵,而不是2个单独的列。矩阵列的名称为value.1993value.1994,但它仍然是单个对象。

可以在基础R中做这种事情,但是对于数据整理和转换,最好使用专门为它设计的一个软件包。

在dplyr:

library(dplyr)
agyrw %>%
    mutate(share93=value.1993/sum(value.1993), share94=value.1994/sum(value.1994))

如果你有multiple columns

vars <- names(agyrw[-1])
names(vars) <- paste0(vars, ".share")
agyrw %>% mutate_each_(funs(./sum(.)), vars)

在sqldf中:

library(sqldf)
names(agyrw) <- c("company", "value1993", "value1994")  # use syntactically valid SQL names
sqldf("select company, value1993, value1994,
              value1993/sum1993 as share1993,
              value1994/sum1994 as share1994
       from (agyrw join (
             select sum(value1993) as sum1993, sum(value1994) as sum1994 from agyrw))")