我使用table()从原始数据创建了一个分类表标记。
我想添加相关标签的频率/数量。例如`
S/n No. Tags Frequency/Count
1 Problem 1 56325
2 Problem 2 11233
3 Problem 3 546321
4 Problem 1 2123345
& Problem 2
5 Problem 2 9657531
& Problem 3
6 Problem 1 623589542
& Problem 2
& Problem 3 `
现在我希望输出是这样的,
S/n no. Tagging Freq/Count
1 Problem 1 (56325+2123345+623589542)=625769212
2 Problem 2 (11233+2123345+9657531+623589542)=635381651
3 Problem 3 (546321+9657531+623589542)=633793394
注意:输出中不会显示()内的数据。
现在,我有78种不同的标记关键字。表格中约有250行。
@Maurits Evers和@akrun正确回答了这个问题。 您需要为此安装tidyverse包。 请输入
如果您没有安装tidyverse,请在R控制台中install.packages( “tidyverse”)
。 访问tidyverse网站了解详情。
答案 0 :(得分:4)
使用tstrsplit
library(data.table)
setDT(df)[,.(Tags = unlist(tstrsplit(Tags, " & ", fixed = TRUE)), # Split by &
Freq = Frequency_Count) # Take the Frequency_Count too
][!is.na(Tags), # ignore non-matches
.(Freq_Count = sum(Freq)), # sum frequencies
by = Tags] # by the splitted tags
# Tags Freq_Count
# 1: Problem 1 625769212
# 2: Problem 2 635381651
# 3: Problem 3 633793394
答案 1 :(得分:3)
请注意,在您的示例中,总结计数时似乎忽略了列S/n no.
中的条目;您没有提供任何详细信息,因此我将忽略此列中的条目。
我们可以使用strsplit
分隔条目,然后unnest
分组,并在Tags
列summarise
中Frequency/Count
之前按library(tidyverse);
df %>%
mutate_if(is.factor, as.character) %>%
select(-SN_No) %>%
mutate(Tags = strsplit(Tags, " & ")) %>%
unnest() %>%
group_by(Tags) %>%
summarise(Freq_Count = sum(Frequency_Count))
## A tibble: 3 x 2
# Tags Freq_Count
# <chr> <int>
#1 Problem 1 625769212
#2 Problem 2 635381651
#3 Problem 3 633793394
分组行:
df <- read.table(text =
"'SN_No' Tags 'Frequency_Count'
1 'Problem 1' 56325
2 'Problem 2' 11233
3 'Problem 3' 546321
4 'Problem 1 & Problem 2' 2123345
5 'Problem 2 & Problem 3' 9657531
6 'Problem 1 & Problem 2 & Problem 3' 623589542", header = T)
{{1}}
答案 2 :(得分:3)
假设你的意思是问题1&amp;问题2,对于频率为2123345,我读取你的数据如下,我使用聚合函数得到我想你想要的结果:
table1 <- read.table(text = '
Tags FrequencyCount
Problem1 56325
Problem2 11233
Problem3 546321
Problem1 2123345
Problem2 2123345
Problem2 9657531
Problem3 9657531
Problem1 623589542
Problem2 623589542
Problem3 623589542',
header = TRUE)
aggregate(FrequencyCount ~ Tags, table1, sum)
Tags FrequencyCount
1 Problem1 625769212
2 Problem2 635381651
3 Problem3 633793394
如果您需要填写缺失值,如下例所示,您可以先执行此操作以复制先前的值:
table1 <- read.table(text = '
Tags FrequencyCount
Problem1 56325
Problem2 11233
Problem3 546321
Problem1 2123345
Problem2 NA
Problem2 9657531
Problem3 NA
Problem1 623589542
Problem2 NA
Problem3 NA',
header = TRUE)
library(data.table)
while(sum(is.na(table1$FrequencyCount)) > 0){
table1$FrequencyCount <- ifelse(is.na(table1$FrequencyCount),
shift(table1$FrequencyCount), table1$FrequencyCount)
}
答案 3 :(得分:3)
以下是separate_rows
library(tidyverse)
df1 %>%
separate_rows(Tags, sep = "\\s+&\\s+") %>%
group_by(Tags) %>%
summarise(SN_No = first(SN_No), Frequency_Count = sum(Frequency_Count)) %>%
select(names(df1))
# A tibble: 3 x 3
# SN_No Tags Frequency_Count
# <int> <chr> <int>
#1 1 Problem 1 625769212
#2 2 Problem 2 635381651
#3 3 Problem 3 633793394
df1 <- structure(list(SN_No = 1:6, Tags = structure(c(1L, 4L, 6L, 2L,
5L, 3L), .Label = c("Problem 1", "Problem 1 & Problem 2",
"Problem 1 & Problem 2 & Problem 3",
"Problem 2", "Problem 2 & Problem 3", "Problem 3"), class = "factor"),
Frequency_Count = c(56325L, 11233L, 546321L, 2123345L, 9657531L,
623589542L)), .Names = c("SN_No", "Tags", "Frequency_Count"
), class = "data.frame", row.names = c(NA, -6L))