在R

时间:2017-06-20 13:07:31

标签: r

我需要一些帮助,找到一种好方法来动态添加列,其中包含我需要从字符串中提取的不同类别的计数。

在我的数据中,我有一个列,其中包含类别名称和计数。这些字段可以为空,也可以包含人们可以想到的任何类别组合。以下是一些例子:

themes:firstcategory_1;secondcategory_33;thirdcategory_5
themes:secondcategory_33;fourthcategory_2
themes:fifthcategory_1

我需要的是每个类别的列(应该有类别名称)和从上面的字符串中提取的计数。类别列表是动态的,所以我事先不知道哪些类别存在。

我该如何处理?

1 个答案:

答案 0 :(得分:0)

此代码将为每个类别提供一列,其中包含每行的计数。

library(dplyr)
library(tidyr)
library(stringr)

# Create test dataframe
df <- data.frame(themes = c("firstcategory_1;secondcategory_33;thirdcategory_5", "secondcategory_33;fourthcategory_2","fifthcategory_1"), stringsAsFactors = FALSE)

# Get the number of columns to split values into
cols <- max(str_count(df$themes,";")) + 1

# Get vector of temporary column names
cols <- paste0("col",c(1:cols))

df <- df %>%
      # Add an ID column based on row number
      mutate(ID = row_number()) %>%
      # Separate multiple categories by semicolon
      separate(col = themes, into = cols, sep = ";", fill = "right") %>%
      # Gather categories into a single column
      gather_("Column", "Value", cols) %>%
      # Drop temporary column
      select(-Column) %>%
      # Filter out NA values
      filter(!is.na(Value)) %>%
      # Separate categories from their counts by underscore
      separate(col = Value, into = c("Category","Count"), sep = "_", fill = "right") %>%
      # Spread categories to create a column for each category, with the count for each ID in that category
      spread(Category, Count)