Question

我对R来说比较新，所以如果这看起来像个愚蠢的问题，请原谅我。我已经开始用尽其他关于如何使这项工作的例子的想法，我希望有人可以帮助指导我朝着正确的方向努力使其发挥作用。

所以我试图在SITE_ID到CLNCL_TRIAL_ID上进行不同的计数。

我的数据实际上是在数据框（data2）中，但这有点像：

CLNCL_TRIAL_ID:
89794,
89794,
8613,
8613

SITE_ID:
12456,
12456,
100341,
30807

我的最终结果将是89794 = 1和8613 = 2

的计数

这是我到目前为止所拥有的：

z <- aggregate(data2$SITE_ID ~ data2$CLNCL_TRIAL_ID, data2, function(SITE_ID) length(unique(data2$SITE_ID)))

我尝试了一些替代形式

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, sum(!duplicated(data$SITE_ID)))

  aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, nlevels(factor(data2$SITE_ID)))

  aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(SITE_ID) length(unique(data2$SITE_ID)))

我一直在遇到的问题是，不是通过trial_ID进行分组，而是计算整个表格。所以89794 = 3而8613 = 3。

有谁知道如何纠正这个问题？我觉得我忽视了一些愚蠢的事情。另外，作为旁注：我试图将其限制在R的基本包中，如果可能的话。如果不可能，那就没什么大不了的了。

Answer 1

有两种方法：

数据：

df <- data.frame(CLNCL_TRIAL_ID = c(89794, 89794,8613, 8613), SITE_ID = c(12456, 12456, 100341, 30807))

Base R - table：

table(df)
               SITE_ID
CLNCL_TRIAL_ID 12456 30807 100341
     8613      0     1      1
     89794     2     0      0

dplyr：

library(dplyr)
df %>% 
  group_by(CLNCL_TRIAL_ID, SITE_ID) %>%
  summarise(count = n())

  CLNCL_TRIAL_ID SITE_ID count
1           8613   30807     1
2           8613  100341     1
3          89794   12456     2

<强>更新

要计算不同，只需使用unique表示基础r，或distinct表示dplyr：

table(unique(df))
## to group/summarise the results you can use rowSums()
rowSums(table(unique(df)))


df %>%
distinct %>%
group_by(CLNCL_TRIAL_ID) %>%
summarise(count = n())

或者，更加简洁地使用马雷克的建议

df %>% distinct %>% count(CLNCL_TRIAL_ID)

Answer 2

使用dplyr包中的函数：

require(dplyr)
data2 %>%
     group_by(CLNCL_TRIAL_ID) %>%
     summarise(nd = n_distinct(SITE_ID))

您的原始方法不起作用，因为您在功能中引用了原始数据集。以下每一项都有效：

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) length(unique(x)))
aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) sum(!duplicated(x)))
aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, function(x) nlevels(factor(x)))

此外：

aggregate(SITE_ID ~ CLNCL_TRIAL_ID, data2, n_distinct)

如果您想混合base和dplyr

Answer 3

包含data.table包的解决方案：

require(data.table)
df <- data.table(CLNCL_TRIAL_ID = c(89794, 89794,8613, 8613), 
    SITE_ID = c(12456, 12456, 100341, 30807))
df[,length(unique(SITE_ID)),by=CLNCL_TRIAL_ID]

可生产

   CLNCL_TRIAL_ID V1
1:          89794  1
2:           8613  2

Answer 4

另一种方法是使用string chunk = streamReader.ReadChars(5); // Read next 5 chars函数的byte[] bytes = streamReader.ReadBytes(5); // Read next 5 bytes参数（可以在byte[]包中找到）。

fun.aggregate

但如果你不想要一个交叉表，你可以使用dcast：

reshape2

R聚合错误：计数不同

4 个答案: