Question

我非常关注总结分类数据的基本问题。我的原始数据由UserId，ItemId，CategoryID形式的多个记录组成。对于每个ItemID，都有一个固定的CategoryID。对于每个UserID，都有一个固定的GroupID。每个UserId可以有一个artibrary数量的条目，但每个ItemID只有一个条目。在我从.csv读取数据时，我将每列设置为一个因素。

这是一个玩具数据集：

uIDs <- c("1", "1", "3", "8", "3", "8", "6")
iIDs <- c("a", "c", "d", "d", "e", "f", "g")
cIDs <- c("V", "V", "A", "A", "A", "A", "M")
gIDs <- c("U", "U", "N", "U", "N", "U", "P")
foo <- data.frame(uID = uIDs, iID = iIDs, cID = cIDs, gID = gIDs)

从这个数据集中，我需要以可用的形式提取各种摘要，例如：

对于每个uID，有多少个iID？
对于每个uID，有多少个cID？
对于每个iID，有多少个uID？
对于每个cID，有多少个uID？
对于每个cID，有多少个gID？
对于每个gID，有多少个cID？

非常简单的东西，但我一天中的大部分时间都在努力。我对输出返回的各种方式感到特别困惑，在各种函数中可以用来帮助它（聚合，汇总，通过，表和朋友）。我们以总结为例。它的输出看起来很有用。但我无法弄清楚如何实现它。

     summary(foo)
 uID    iID   cID   gID  
  8:1   a:1   A:4   N:2  
 1 :2   c:1   M:1   P:1  
 3 :2   d:2   V:2   U:4  
 6 :1   e:1              
 8 :1   f:1              
        g:1

当我问结果是什么时，结果非常复杂，我不知道如何将其剥离以获得我想要的结果。

    > str(summary(foo))
 'table' chr [1:6, 1:4] " 8:1  " "1 :2  " "3 :2  " "6 :1  " ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:6] "" "" "" "" ...
  ..$ : chr [1:4] "uID" "iID" "cID" "gID"

鉴于我的需求很简单，提出问题的最简单方法是什么，以便我可以轻松地进行操作？

谢谢！

P.S。对不起，如果代码粘贴格式不正确 - 尝试从Rstudio粘贴但看起来不正确 - 建议欢迎（尝试搜索建议没有找到任何东西，但我知道它在某处，因为我读它大约6个月前...）

Answer 1

您可以使用aggregate。我想这就是你要找的东西：

> # for each uID, how many iIDs are there? and 
> # for each uID, how many cIDs are there?
> aggregate(cbind(iIDs, cIDs) ~ uID, length, data=foo)
  uID iIDs cIDs
1   1    2    2
2   3    2    2
3   6    1    1
4   8    1    1 # due to the error in the toy example there are two 8
5   8    1    1 # one for "8" and one for " 8" ;)
> 
> # or individually:
> # aggregate(uIDs ~ iID, length, data=foo) 
> # aggregate(uIDs ~ cID, length, data=foo)
>  
> #-------------------------------------------------------------
> # for each iIDs, how many uIDs are there?
> aggregate(uIDs ~ iID, length, data=foo)
  iID uIDs
1   a    1
2   c    1
3   d    2
4   e    1
5   f    1
6   g    1
> #-------------------------------------------------------------
> 
> # for each cID, how many uIDs are there? and
> # for each cID, how many gIDs are there?
> aggregate(cbind(uIDs, gIDs) ~ cID, length, data=foo)
  cID uIDs gIDs
1   A    4    4
2   M    1    1
3   V    2    2
> 
> #-------------------------------------------------------------
> # for each gID, how many cIDs are there?
> aggregate(cIDs ~ gID, length, data=foo)
  gID cIDs
1   N    2
2   P    1
3   U    4

Answer 2

您可以回答大部分问题：

对于每个uID，有多少个iID？

with(foo, rowSums(table(uID, iID)))

1 3 6 8 
2 2 1 2

NB 我认为您的示例数据存在轻微错误。您的一个uID是“8”而不是“8”，这使我感到困惑。

用于引用分类变量/因子的汇总频率计数结果的语法

2 个答案: