Question

我有一个包含3列的数据框（DF）：ID号，类别和品牌：

ID             Category        Brand
00129          Bits            B89
00129          Bits            B87
00129          Bits            B87
00129          Logs            B32
00129          Logs            B27
00129          Logs            B27
00130          Bits            B12
00130          Bits            B14
00130          Bits            B14
00131          Logs            B32
00131          Logs            B27
00131          Logs            B32
00132          Bits            B77
00132          Bits            B89
00132          Bits            B89

我有200个不同的类别和2000个不同的品牌。

我想找到每个类别最常用ID号码购买的两个品牌：

Category       Brand
Bits           B89,B87
Logs           B32,B27

或：

#$Bits
    #[1] "B89" "B87"

#$Logs
    #[1] "B32" "B27"

我能想到的唯一方法是将数据框重新设计为这样，以确保通过确认不同的ID号进行计算：

     B89   B87   B32   B27   B12   B14  
1    1     2     1     2     0     0
2    0     0     0     0     1     2
3    0     0     2     1     0     0
4    2     1     0     0     0     0

然后在使用大于0的值填充某个列时，返回填充了大于0的值的列。

list1 =(setNames(object = lapply(1:NCOL(df), function(i)
  unique(colnames(df)[-i][which(as.matrix(df[which(df[,i] > 0),i])>0,
                                   arr.ind = TRUE)[,2]])),
  nm = colnames(df)))

然后我牺牲了我需要的类别。关于如何解决这个问题的任何想法？

Answer 1

这可能会成功。结束了data.table和dplyr的组合，因为我还不熟悉data.table。

dt = data.table(read.table(text="ID             category              brand
00129          Bits            B89
00129          Bits            B87
00129          Bits            B87
00129          Logs            B32
00129          Logs            B27
00129          Logs            B27
00130          Bits            B12
00130          Bits            B14
00130          Bits            B14
00131          Logs            B32
00131          Logs            B27
00131          Logs            B32
00132          Bits            B77
00132          Bits            B89
00132          Bits            B89",header=T))

library(data.table)
library(dplyr)

# get all combinations of two purchases.
dt = dt[,.(list(unique(brand))),.(ID,category)][, .(combn(unlist(V1), 2,simplify=FALSE)),.(ID,category)]

# concatenate those two purchases to a string
dt$V1 = unlist(lapply(dt$V1,function(x) {paste(x,collapse=", ")}))

# fetch only the top per category
dt %>% group_by(V1,category) %>% summarize(n=n()) %>% group_by(category) %>% top_n(n = 1) %>% select(-n)

输出：

        V1 category
1 B12, B14     Bits
2 B32, B27     Logs
3 B77, B89     Bits
4 B89, B87     Bits

我认为这是正确的，考虑到您的数据集，虽然它与您的预期输出不匹配？

可选择添加

dt %>% group_by(ID,category) %>% mutate(unique_types = n_distinct(brand)) %>% filter(unique_types>1)

如果仅使用一个品牌进行购买，请在前面

，因为如果combn(n,m)

，length(n)<m将无效

R：找到每个类别最常见的两个字符串

1 个答案: