Question

假设我有以下数据框：

userID <- c(1, 1, 3, 5, 3, 5)
A      <- c(2, 3, 2, 1, 2, 1)
B      <- c(2, 3, 1, 0, 1, 0)
df     <- data.frame(userID, A, B)
df
#   userID A B
# 1      1 2 2
# 2      1 3 3
# 3      3 2 1
# 4      5 1 0
# 5      3 2 1
# 6      5 1 0

我想创建一个具有相同列但具有添加的最终列的数据框，该列计算唯一元组的数量/其他列的组合。输出应如下所示：

userID A B count
     1 2 2     1
     1 3 3     1
     3 2 1     2 
     5 1 0     2

意思是（1,2,2）的元组/组合与count = 1发生，而（3,2,1）的元组发生两次，因此count = 2。我希望不使用任何外部包。

Answer 1

1）汇总

ag <- aggregate(count ~ ., cbind(count = 1, df), length)
ag[do.call("order", ag), ]  # sort the rows

，并提供：

  userID A B count
3      1 2 2     1
4      1 3 3     1
2      3 2 1     2
1      5 1 0     2

如果行的顺序不重要，可以省略对行进行排序的最后一行代码。

其余解决方案使用指定的包：

2）sqldf

library(sqldf)
Names <- toString(names(df))
fn$sqldf("select *, count(*) count from df group by $Names order by $Names")

，并提供：

  userID A B count
1      1 2 2     1
2      1 3 3     1
3      3 2 1     2
4      5 1 0     2

如果订单不重要，可以省略order by子句。

3）dplyr

library(dplyr)
df %>% regroup(as.list(names(df))) %>% summarise(count = n())

，并提供：

Source: local data frame [4 x 4]
Groups: userID, A
  userID A B count
1      1 2 2     1
2      1 3 3     1
3      3 2 1     2
4      5 1 0     2

4）data.table

library(data.table)
data.table(df)[, list(count = .N), by = names(df)]

，并提供：

   userID A B count
1:      1 2 2     1
2:      1 3 3     1
3:      3 2 1     2
4:      5 1 0     2

ADDED 其他解决方案。还有一些小的改进。

Answer 2

这是一种相当直接的方式（ave来救援！）：

unique(cbind(df, 
             count = ave(rep(1, nrow(df)),
                         do.call(paste, df), 
                         FUN = length)))
#   userID A B count
# 1      1 2 2     1
# 2      1 3 3     1
# 3      3 2 1     2
# 4      5 1 0     2

以上是上述内容的变体：

unique(within(df, {
  counter <- rep(1, nrow(df))
  count <- ave(counter, df, FUN = length)
  rm(counter)
}))
#   userID A B count
# 1      1 2 2     1
# 2      1 3 3     1
# 3      3 2 1     2
# 4      5 1 0     2

Answer 3

userID <- c(1, 1, 3, 5, 3, 5)
A      <- c(2, 3, 2, 1, 2, 1)
B      <- c(2, 3, 1, 0, 1, 0)
df     <- data.frame(userID, A, B)

快速计算元组：

df$AB <- as.factor(paste(df$userID,df$A,df$B, sep=""))

没有外部包只是利用summary（）并将其存储为DF然后合并原始数据的计数：

df2 <- as.data.frame(summary(df$AB))
df2 <- data.frame(x=row.names(df2), y=df2[1])
names(df2) <- c("AB", "count")
df <- merge(df, df2, by="AB", all.x=TRUE)
df$AB <- NULL

几乎是最终输出，只是愚蠢：

df

  userID A B count
1      1 2 2     1
2      1 3 3     1
3      3 2 1     2
4      3 2 1     2
5      5 1 0     2
6      5 1 0     2

最后，清理欺骗行为：

df <- df[!duplicated(df), ]

你走了：

df

  userID A B count
1      1 2 2     1
2      1 3 3     1
3      3 2 1     2
5      5 1 0     2

有一段时间没有用sql或plyr做到这一点。如果你以后可以使用dplyr或包来做。如果Bioconductor开始变得更加复杂，它有很多很好的测序包。

希望这有帮助。

Answer 4

这应该可以解决问题，即使它有点难看：

vec <- table(apply(df,1,paste,collapse=""))

df2 <- data.frame(do.call(rbind,strsplit(names(vec),"")))

names(df2) <- names(df)
df2$count <- vec

#  userID A B count
#1      1 2 2     1
#2      1 3 3     1
#3      3 2 1     2
#4      5 1 0     2

添加一列用于计算数据框中的唯一元组

4 个答案: