按R计算最常用的行数

时间:2016-09-02 19:52:39

标签: python r pyspark rpy2 word-frequency

下面有一个表格

   Name     Mon    Tue     Wed    Thu     Fri    Sat    Sun

1 John     Apple  Orange  Apple  Banana  Apple  Apple  Orange
2 Ricky    Banana Apple   Banana Banana  Banana Banana Apple
3 Alex     Apple  Orange  Orange Apple   Apple  Orange Orange
4 Robbin   Apple  Apple   Apple  Apple   Apple  Banana Banana
5 Sunny    Banana Banana  Apple  Apple   Apple  Banana Banana

所以,我想为每个人计算最频繁的水果,并在新栏中添加这些值。

例如。

   Name     Mon    Tue     Wed    Thu     Fri    Sat    Sun      Max_Acc  Count

1 John     Apple  Orange  Apple  Banana  Apple  Apple  Orange     Apple       4
2 Ricky    Banana Apple   Banana Banana  Banana Banana Apple      Banana      5
3 Alex     Apple  Orange  Orange Apple   Apple  Orange Orange     Orange      4
4 Robbin   Apple  Apple   Apple  Apple   Apple  Banana Banana     Apple       5
5 Sunny    Banana Banana  Apple  Apple   Apple  Banana Banana     Banana      4

我在查找行时遇到问题。我可以使用table()函数在列中找到频率。

>table(df$Mon)

 Apple  Banana
  3      2

但在这里,我想要新专栏中最常见水果的名称。

2 个答案:

答案 0 :(得分:2)

如果我们需要与max“Count”对应的“Count”和“Names”,我们循环遍历数据集的行(使用applyMARGIN = 1),使用table获取频率,从中提取最大值以及与原始数据集最大值namesrbind对应的cbind

cbind(df1, do.call(rbind, apply(df1[-1], 1, function(x) {
              x1 <- table(x)
             data.frame(Count = max(x1), Names=names(x1)[which.max(x1)])})))

#    Name    Mon    Tue    Wed    Thu    Fri    Sat    Sun Count  Names
#1   John  Apple Orange  Apple Banana  Apple  Apple Orange     4  Apple
#2  Ricky Banana  Apple Banana Banana Banana Banana  Apple     5 Banana
#3   Alex  Apple Orange Orange  Apple  Apple Orange Orange     4 Orange
#4 Robbin  Apple  Apple  Apple  Apple  Apple Banana Banana     5  Apple
#5  Sunny Banana Banana  Apple  Apple  Apple Banana Banana     4 Banana

或者我们可以使用data.table

library(data.table)
setDT(df1)[, c("Names", "Count") := {tbl <- table(unlist(.SD))
                    .(names(tbl)[which.max(tbl)], max(tbl))}, by = Name]

答案 1 :(得分:1)

另一种方法是循环所有独特的水果如下

fruits_unique <- unique(unlist(dat[-1]))
occurence <- sapply(fruits_unique, function(x) rowSums(dat[,-1] == x)) 
# Using this data to create the resulting columns
ind <- apply(occurence,1,which.max)
dat$Names <- fruits_unique[ind]
dat$count <- occurence[cbind(seq_along(ind), ind)]

结果:

    Name    Mon    Tue    Wed    Thu    Fri    Sat    Sun  Names Count
1   John  Apple Orange  Apple Banana  Apple  Apple Orange  Apple     4
2  Ricky Banana  Apple Banana Banana Banana Banana  Apple Banana     5
3   Alex  Apple Orange Orange  Apple  Apple Orange Orange Orange     4
4 Robbin  Apple  Apple  Apple  Apple  Apple Banana Banana  Apple     5
5  Sunny Banana Banana  Apple  Apple  Apple Banana Banana Banana     4