我需要一些帮助,找到一种好方法来动态添加列,其中包含我需要从字符串中提取的不同类别的计数。
在我的数据中,我有一个列,其中包含类别名称和计数。这些字段可以为空,也可以包含人们可以想到的任何类别组合。以下是一些例子:
themes:firstcategory_1;secondcategory_33;thirdcategory_5
themes:secondcategory_33;fourthcategory_2
themes:fifthcategory_1
我需要的是每个类别的列(应该有类别名称)和从上面的字符串中提取的计数。类别列表是动态的,所以我事先不知道哪些类别存在。
我该如何处理?
答案 0 :(得分:0)
此代码将为每个类别提供一列,其中包含每行的计数。
library(dplyr)
library(tidyr)
library(stringr)
# Create test dataframe
df <- data.frame(themes = c("firstcategory_1;secondcategory_33;thirdcategory_5", "secondcategory_33;fourthcategory_2","fifthcategory_1"), stringsAsFactors = FALSE)
# Get the number of columns to split values into
cols <- max(str_count(df$themes,";")) + 1
# Get vector of temporary column names
cols <- paste0("col",c(1:cols))
df <- df %>%
# Add an ID column based on row number
mutate(ID = row_number()) %>%
# Separate multiple categories by semicolon
separate(col = themes, into = cols, sep = ";", fill = "right") %>%
# Gather categories into a single column
gather_("Column", "Value", cols) %>%
# Drop temporary column
select(-Column) %>%
# Filter out NA values
filter(!is.na(Value)) %>%
# Separate categories from their counts by underscore
separate(col = Value, into = c("Category","Count"), sep = "_", fill = "right") %>%
# Spread categories to create a column for each category, with the count for each ID in that category
spread(Category, Count)