规范化R中的数据

时间:2013-05-20 10:25:50

标签: r

您好我有以下data.frame(追加)。我想添加一个标准化计数N = N/sum(N)的附加列。我有一个以前没有日期列的data.frame,并且能够使用

执行此操作

oo[, N.norm := N/sum(N), by=Operator]

我尝试将日期添加到按功能

oo[, N.norm := N/sum(N), by=Operator,Date]

但收到错误消息

Error in `[.data.frame`(oo, , `:=`(N.norm, N/sum(N)), by = Operator, Date) : 
  unused argument(s) (by = Operator)

例如,对于“2013年1月”中的运营商“A”,我有N = c(“好”,“好”,“差”的计数ROI_SCore, “废话”)。我想将N组合用于该组合(A和2013年1月)并将计数N除以sum(N)

另一方面,任何人都可以向我提供一个体面的介绍来操作R

中的data.frames
structure(list(Operator = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("A", 
"D", "J", "L", "M"), class = "factor"), ROI_Score = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
4L, 4L, 4L), .Label = c("Crap", "Good", "OK", "Poor"), class = "factor"), 
    Date = c("Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013", "Apr 2013", "Feb 2013", "Jan 2013", "Mar 2013", 
    "May 2013"), N = c(0, 0, 0, 0, 0, 1, 2, 15, 1, 5, 3, 2, 3, 
    1, 0, 3, 0, 5, 5, 1, 0, 0, 0, 1, 0, 14, 17, 16, 8, 7, 5, 
    10, 6, 1, 5, 24, 27, 31, 16, 15, 0, 0, 0, 0, 0, 26, 24, 20, 
    11, 18, 3, 4, 17, 3, 2, 20, 36, 12, 21, 9, 0, 0, 0, 0, 0, 
    3, 12, 5, 12, 4, 0, 0, 3, 4, 0, 29, 37, 41, 25, 10, 0, 0, 
    0, 0, 0, 9, 9, 15, 17, 3, 6, 4, 5, 4, 1, 14, 13, 9, 15, 9
    )), .Names = c("Operator", "ROI_Score", "Date", "N"), row.names = c(NA, 
100L), class = "data.frame")

我不确定数据是否为data.frame或data.table格式。这是我的代码,改编自Arun(reshape/remould data frame to create normalized bar chart and pie chart

给出的解决方案
df <- data.frame(read.csv("/misc/jaguar_data/report/system/db_fs/roi_scores.csv"))
#Get date into nice structure for faceting
df$Date = strftime(strptime(df$Date,f="%d/%m/%Y"), "%b %Y")
dt <- data.table(df)
ops <- as.character(unique(dt$Operator))
scr <- as.character(unique(dt$ROI_Score))
dts <- unique(dt$Date)

oo <- setkey(dt[, .N, by="Operator,ROI_Score,Date"], Operator,
ROI_Score,Date)[CJ(ops, scr,dts)][is.na(N), N:= 0L]

oo[, N.norm := N/sum(N), by=Operator]

2 个答案:

答案 0 :(得分:5)

您的代码(几乎)完美无缺。两个小问题。

1:您使用的是data.table语法,因此oo应该是data.frame而不是data.table library(data.table) oo <- data.table(oo) 。只需使用:

by

2:当list(..)使用多个列时,请确保将列包装在 oo[, N.norm := N/sum(N), by=list(Operator,Date)] # - or - # oo[, N.norm := N/sum(N), by="Operator,Date"] 中或作为一个逗号分隔的字符串包装。示例

Operator

编辑:如果您希望除以每个Date - oo[, N.norm := N/sum(DT$N), by=list(Operator,Date)] 组的总计,那么您的代码应如上所述。相反,如果要除以整个数据的总和,则使用

     Operator ROI_Score     Date  N    N.norm
  1:        A      Crap Apr 2013  0 0.0000000
  2:        A      Crap Feb 2013  0 0.0000000
  3:        A      Crap Jan 2013  0 0.0000000
  4:        A      Crap Mar 2013  0 0.0000000
  5:        A      Crap May 2013  0 0.0000000
 ---                                         
 96:        M      Poor Apr 2013 14 0.4827586
 97:        M      Poor Feb 2013 13 0.5000000
 98:        M      Poor Jan 2013  9 0.3103448
 99:        M      Poor Mar 2013 15 0.4166667
100:        M      Poor May 2013  9 0.6923077

修复这两件事并完全按照原样使用其他所有内容:

[

编辑2:

请注意。通常,如果您在]括号:=中使用表达式,尤其是按引用分配运算符data.table,则您的对象应为 Error in `[.data.frame`( _<your object name>_, ...

如果您看到错误,例如

package

然后这可能是因为(a)您的对象不是data.table或(b)您忘记加载data.table {{1}}。

答案 1 :(得分:1)

我认为你不能用这个数据集做你想做的事。这就是原因:

install.packages("plyr")
library("plyr")
str(tmp) # this is your data
count(tmp, vars = c("Operator", "ROI_Score"))

给出这个:

   Operator ROI_Score freq
1         A      Crap    5
2         A      Good    5
3         A        OK    5
4         A      Poor    5
5         D      Crap    5
6         D      Good    5
7         D        OK    5
8         D      Poor    5
9         J      Crap    5
10        J      Good    5
11        J        OK    5
12        J      Poor    5
13        L      Crap    5
14        L      Good    5
15        L        OK    5
16        L      Poor    5
17        M      Crap    5
18        M      Good    5
19        M        OK    5
20        M      Poor    5

包括Date使每个值都唯一,因此所有值都为1。

使用data.frame,您原则上可以通过以下方式获得:

ans <- aggregate(N ~ Operator + ROI_Score + Date, data = tmp, FUN = sum)

然后更改函数以完全按照您的要求(除以100,条目数?)。但我不确定这是你想要的。

修改

由于您希望按操作员和日期计算每个评级类别的百分比,因此我首先将子集聚合,然后汇总:

tmp2 <- subset(tmp, Operator == "A")
ans2 <- aggregate(N ~ ROI_Score, data = tmp2, FUN = sum)
ans2$N.norm <- ans2$N/sum(ans2$N)

给出:

  ROI_Score  N    N.norm
1      Crap  0 0.0000000
2      Good 24 0.5106383
3        OK  9 0.1914894
4      Poor 14 0.2978723