将描述性统计信息行导出到R

时间:2016-02-16 14:56:44

标签: r excel

我有一个超过85,000个值的大型数据库,超过100个不同的公司标记了超过100个变量。我的目标是确定与几个变量对应的描述性统计(平均值,标准差,最小值和值的数量)。

以下是一组给定公司的信息,我将其称为公司F.

Attendance   Number of representatives   Number of Presenters     Company Audience  
29           2                            30                      2
20           3                            30                      4   
30           10                           20                      5
40           20                           10                      5
10           30                           13                      5

我要做的是让R计算描述性统计数据[均值,标准差,最小值和最大值],并为这些特定列中的每一列输出,并按以下方式将其导出到Excel中:

Company F  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 

因为这是一个很长的行,我将通过说我试图找到每个列的描述性统计[平均值,标准差,最小值,最大值和n]来总结它。这些都应该与公司F相对应。

我是如何尝试解决此问题的:

我使用了R中的描述性统计功能来获取数据帧以便为我识别代码。要做到这一点,我使用了心理包:

 library(psych)
 describe(CompanyF$Attendance)
 describe(CompanyF$NumberofRepresentatives)
 describe(CompanyF$Number_of_Presenters
 describe(CompanyF$Company Audience)

从使用包中我能够获取数据帧然后进入Excel并手动构建行,输入我收到的值并省略心理库包给出的任何与我不相符的信息以下是我从心理学方案中获得的信息类型的一个例子:

vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
1    1 559 2.02 2.21      1    1.75 1.48   0   9     9 0.78    -0.65 0.09

此过程非常耗时且容易出错。完成F公司的工作后,我在公司F的一个下面的Excel中创建了一个新行,但这次是另一家公司,例如G公司,我继续寻找描述性统计数据的过程[均值,标准差,min每个感兴趣的变量(出席率,代表人数,演示者数量和公司受众)中的每一个,最大值和最大值n。

我已经找到了各种解决方案,其中一个来自这个堆栈溢出帖子Export data from R into Excel,但是我无法找到如何从R行逐行导入Excel信息以及如何指定的解释它标识了我上面列出的描述性统计数据。

理想情况下,我会将以下输出放入Excel:

Company F  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 
Company G  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 
Company H  Average Number of Attendance Standard Deviation of Number of Attendance Min  Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives   Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 

等等。

我的数据的原始子集如下:

structure(list(sn = structure(c(2L, 2L, 3L, 5L, 2L, 7L, 1L, 9L, 
1L, 9L, NA, 9L, 1L, 26L, 11L, 9L, 7L, NA, NA, 7L, 17L, 9L, NA, 
21L, 7L, 17L, 7L, 7L, 16L, 7L, 7L, 7L, 7L, 26L, 7L, 6L, 26L, 
22L, NA, NA, 11L, 23L, 23L, 26L, NA, 7L, 23L, 1L, NA, 1L, 7L, 
11L, 12L, 13L, 9L, NA, 15L, NA, 20L, 15L, NA, 17L, 5L, NA, 22L, 
15L, NA, NA, 5L, 8L, 32L, 29L, 23L, 33L, 1L, 23L, 14L, 6L, 7L, 
15L), .Label = c("Broome Street", "Company A", "Company B", "Company BC",
"Company C", "Company CC", "Company D Clinton", "Company DD", 
"Company E", "Company ED BroadCompany", "Company G", "Company H     
BroadCompany", 
"Company I BroadCompany", "Company I Studio", "Company J", "Company K", 
"Company L", "Company M", "Company M BroadCompany", "Company M HS    
 BroadCompany", 
"Company MCC BroadCompany", "Company N", "Company P", "Company Q", 
"Company Q Company N", "Company Q Company ZZ", "Company R - Company ZZ", 
"Company SLab", "Company Z", "Company ZE", "Company ZED", "Company ZEQ", 
"Company ZZ", "Company ZZQ", "Company ZZQ Company N"), class = "factor"), 
earn_tot = c(21.85, 20.8, NA, 8.16, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 43.32, NA, 30.48, NA, NA, 34.9, NA, NA, NA, NA, NA, 25.82, 
40.75, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 
30, NA, NA, NA, NA, NA, NA, 39.1, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 52.29, 44.32, NA, 7, 38.32, 0, NA, NA, 8.25, 
NA, NA), earn_and_current_tot = c(29.43, 20.8, NA, 8.16, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.16, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 49.9, NA, 37.56, NA, NA, 41.98, 
NA, NA, NA, NA, NA, 37.32, 49, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 0, NA, NA, NA, 37, NA, NA, NA, NA, NA, NA, 47.68, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57.29, 48.48, NA, 
7, 45.9, 0, NA, NA, 15.75, NA, NA), pass_99 = c(0L, 0L, NA, 
NA, NA, NA, 1L, NA, NA, NA, NA, 5L, NA, 0L, NA, 5L, NA, NA, 
NA, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 0L, NA, NA, NA, NA, 5L, NA, NA, NA, NA, 4L, 0L, 
NA, NA, NA, 4L, 4L, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, 
NA, 1L, NA, NA, NA, NA, 1L, NA, NA, 0L, 4L, 0L, NA, NA, 0L, 
NA, NA), pass_65 = c(0L, 0L, 5L, 0L, 6L, NA, 0L, 5L, NA, 
5L, NA, 6L, NA, 0L, 5L, 2L, NA, NA, NA, 0L, 5L, 5L, NA, NA, 
NA, 0L, NA, 1L, 4L, 7L, 5L, 5L, 7L, 0L, 5L, NA, 0L, 1L, NA, 
NA, NA, 2L, 0L, 6L, NA, 8L, 2L, 0L, NA, 4L, 0L, 1L, 3L, NA, 
NA, NA, NA, NA, 4L, 0L, NA, 5L, 7L, NA, 0L, NA, NA, NA, 5L, 
0L, 5L, 4L, 0L, 2L, 0L, 0L, 7L, 0L, NA, 5L)), .Names = c("sn", 
"earn_tot", "earn_and_current_tot", "pass_99", "pass_65"), row.names = c(NA, 
80L), class = "data.frame")

有四个子集列是最重要的。这些列是“earn_tot”,“earn_and_current_tot”,“pass_99”和“pass_65”。这里列出的众多公司都是匿名的。我正在与大约100家公司合作。名为“sn”的栏目下有许多公司名称。整个子集化数据集的名称称为Subset.MergedEx.So。

我为没有提供一个好的可重复的例子而道歉。感谢您的耐心等待。我一直在阅读如何构建一个并使用以下代码:     dput((head(Subset.MergedEx.SO,80)))

2 个答案:

答案 0 :(得分:1)

您可以做的是将melt您的数据转换为长格式,然后使用多个聚合函数将其转换回宽格式:

library(data.table)
dat.new <- dcast(melt(dat, id="company"),
                 company ~ variable, 
                 fun = list(mean,sd), 
                 value.var = "value")

这给出了:

> dat.new
   company value_mean_attendance value_mean_presenters value_mean_audience value_sd_attendance value_sd_presenters value_sd_audience
1:       A                   8.0                  24.8                60.6            1.870829            4.207137          7.668116
2:       B                   8.2                  23.8                64.2            2.489980            2.387467          2.049390

现在您可以使用例如 WriteXLS 包将其写入excel文件:

library(WriteXLS)
WriteXLS("dat.new","companies.xls")

因为您想要为每个公司计算许多统计信息,您可能需要考虑将每个公司的摘要统计信息写入excel文件中的单独表格。

同样,您使用melt将数据转换为长格式,然后使用lapply(.SD, function(x) list(average = mean(x), sdev = sd(x)))$value对每个公司和每个变量进行汇总。在data.table列表中按公司拆分生成的data.table。最后将该列表写入excel文件:

dat.new <- melt(dat, id="company")[, lapply(.SD, function(x) list(average = mean(x), sdev = sd(x)))$value, 
                                    .(company,variable)]

company.list <- split(dat.new, dat.new$company)

WriteXLS(company.list,"companies.xls")

现在你有一个excel文件,每个公司都有一个单独的标签。

使用过的数据:

set.seed(21)
dat <- data.table(company = rep(c("A","B"), each = 5),
                  attendance = sample(5:10,10,TRUE),
                  presenters = sample(20:30,10,TRUE),
                  audience = sample(50:70,10,TRUE))

答案 1 :(得分:1)

这可能不是最佳解决方案,但它仅使用basepsych包。

这是数据

df <- data.frame(company = rep(c("A","B", "C","D"), each = 5),
              attendance = sample(5:10,20,TRUE),
              representatives = sample(2:30,20,TRUE),
              presenters = sample(20:30,20,TRUE),
              audience = sample(50:70,20,TRUE))

我写了一个函数来获取你需要的值。 我假设您只有5类信息:公司名称,出席率,代表,演示者,观众。

    get.values<-function(x){
    require(psych)
    info<-describeBy(x[,2:5], group = x[,1])
    n.companies<-length(levels(df[,1]))
    n<-list()
    mean<-list()
    sd<-list()
    min<-list()
    max<-list()
    for(i in 1:n.companies){
      n[[i]]<-info[[i]][,2]
      mean[[i]]<-info[[i]][,3]
      sd[[i]]<-info[[i]][,4]
      min[[i]]<-info[[i]][,8]
      max[[i]]<-info[[i]][,9]
    }
  l<-Map(c, mean, sd, min, max, n)
  valuedf<-do.call(rbind, l)
return(valuedf)
}

我还写了一个函数来生成你想要的列名,你可以将它们命名为你想要的任何名称:

get.names<-function(x){
      require(psych)
      names<-rownames(describe(x[,2:5]))
      avg<-character()
      sd<-character()
      min<-character()
      max<-character()
      total<-character()
  for(i in 1:length(names)){
      avg[i]<-paste("average number of", names[i])
      sd[i]<-paste("standard deviation of", names[i])
      min[i]<-paste("min number of", names[i])
      max[i]<-paste("max number of", names[i])
      total[i]<-paste("total number of", names[i])
  }
  cnames<-c(avg,sd,min,max,total)
return(cnames)
}

将值和名称合并到一个新的数据框中:

output<-get.values(df)
col.names<-get.names(df)
colnames(output)<-col.names
rownames(output)<-levels(df[,1]) 

导出到Excel:

library(xlsx)
write.xlsx(output, "descriptives.xlsx")