在R

时间:2018-04-25 05:06:49

标签: r data.table tidyverse

我使用table()从原始数据创建了一个分类表标记。

我想添加相关标签的频率/数量。例如`

S/n No.     Tags    Frequency/Count

1      Problem 1      56325
2      Problem 2      11233
3      Problem 3      546321
4      Problem 1      2123345 
       & Problem 2      
5      Problem 2      9657531
       & Problem 3
6      Problem 1      623589542
       & Problem 2
       & Problem 3 ` 

现在我希望输出是这样的,

S/n no.  Tagging     Freq/Count

1        Problem 1  (56325+2123345+623589542)=625769212
2        Problem 2  (11233+2123345+9657531+623589542)=635381651
3        Problem 3  (546321+9657531+623589542)=633793394

注意:输出中不会显示()内的数据。

现在,我有78种不同的标记关键字。表格中约有250行。

@Maurits Evers和@akrun正确回答了这个问题。 您需要为此安装tidyverse包。 请输入

  

install.packages( “tidyverse”)

如果您没有安装,请在R控制台中

。 访问tidyverse网站了解详情。

4 个答案:

答案 0 :(得分:4)

使用tstrsplit

的data.table解决方案
library(data.table)
setDT(df)[,.(Tags = unlist(tstrsplit(Tags, " & ", fixed = TRUE)), # Split by &
             Freq = Frequency_Count) # Take the Frequency_Count too
          ][!is.na(Tags), # ignore non-matches
            .(Freq_Count = sum(Freq)), # sum frequencies
            by = Tags] # by the splitted tags
#         Tags Freq_Count
# 1: Problem 1  625769212
# 2: Problem 2  635381651
# 3: Problem 3  633793394

答案 1 :(得分:3)

请注意,在您的示例中,总结计数时似乎忽略了列S/n no.中的条目;您没有提供任何详细信息,因此我将忽略此列中的条目。

我们可以使用strsplit分隔条目,然后unnest分组,并在TagssummariseFrequency/Count之前按library(tidyverse); df %>% mutate_if(is.factor, as.character) %>% select(-SN_No) %>% mutate(Tags = strsplit(Tags, " & ")) %>% unnest() %>% group_by(Tags) %>% summarise(Freq_Count = sum(Frequency_Count)) ## A tibble: 3 x 2 # Tags Freq_Count # <chr> <int> #1 Problem 1 625769212 #2 Problem 2 635381651 #3 Problem 3 633793394 分组行:

df <- read.table(text =
    "'SN_No'     Tags    'Frequency_Count'
1      'Problem 1'      56325
2      'Problem 2'      11233
3      'Problem 3'      546321
4      'Problem 1 & Problem 2'     2123345
5      'Problem 2 & Problem 3'      9657531
6      'Problem 1 & Problem 2 & Problem 3'      623589542", header = T)

样本数据

{{1}}

答案 2 :(得分:3)

假设你的意思是问题1&amp;问题2,对于频率为2123345,我读取你的数据如下,我使用聚合函数得到我想你想要的结果:

table1 <- read.table(text = '
  Tags    FrequencyCount
  Problem1      56325
  Problem2      11233
  Problem3      546321
  Problem1      2123345 
  Problem2     2123345
  Problem2     9657531
  Problem3     9657531
  Problem1     623589542
  Problem2     623589542
  Problem3     623589542',
                 header = TRUE) 


aggregate(FrequencyCount ~ Tags, table1, sum)

      Tags FrequencyCount
1 Problem1      625769212
2 Problem2      635381651
3 Problem3      633793394

如果您需要填写缺失值,如下例所示,您可以先执行此操作以复制先前的值:

table1 <- read.table(text = '
  Tags    FrequencyCount
  Problem1      56325
  Problem2      11233
  Problem3      546321
  Problem1      2123345 
  Problem2     NA
  Problem2     9657531
  Problem3     NA
  Problem1     623589542
  Problem2     NA
  Problem3     NA',
                 header = TRUE) 

library(data.table)
while(sum(is.na(table1$FrequencyCount)) > 0){
table1$FrequencyCount <- ifelse(is.na(table1$FrequencyCount), 
shift(table1$FrequencyCount), table1$FrequencyCount)
}

答案 3 :(得分:3)

以下是separate_rows

的一个选项
library(tidyverse)
df1 %>% 
  separate_rows(Tags, sep = "\\s+&\\s+") %>% 
  group_by(Tags) %>% 
  summarise(SN_No = first(SN_No), Frequency_Count = sum(Frequency_Count)) %>%
  select(names(df1))
# A tibble: 3 x 3
#    SN_No Tags      Frequency_Count
#   <int> <chr>               <int>
#1     1 Problem 1       625769212
#2     2 Problem 2       635381651
#3     3 Problem 3       633793394

数据

df1 <- structure(list(SN_No = 1:6, Tags = structure(c(1L, 4L, 6L, 2L, 
 5L, 3L), .Label = c("Problem 1", "Problem 1 & Problem 2", 
  "Problem 1 & Problem 2 & Problem 3", 
 "Problem 2", "Problem 2 & Problem 3", "Problem 3"), class = "factor"), 
Frequency_Count = c(56325L, 11233L, 546321L, 2123345L, 9657531L, 
623589542L)), .Names = c("SN_No", "Tags", "Frequency_Count"
 ), class = "data.frame", row.names = c(NA, -6L))