根据每个单元格中的多个chr值在R中创建虚拟变量

时间:2016-11-18 12:02:23

标签: r string character dplyr dummy-variable

我试图根据名为' Tags'在我的df中(14行,2列,分数和标签。我的问题是在每个单元格中可以有任意数量的chr值(最多约30个值)。

当我要求:

 str(df$Tags)

R返回:

chr [1:14] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolera"| __truncated__ ...

当我要求:

df$Tags[1]

R返回:

[1] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolerantie\", \"met familie\", \"met vrienden\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", chinees, gastronomisch, glutenvrij, kindvriendelijk, romantisch, traditioneel, trendy, verjaardag, zakelijk" 

似乎第一个单元格中的值格式不一样(逗号之间的值)

所以我希望的是为每个单元格中出现的每个可能值创建一个虚拟变量。因此,第一个新假人应该被称为" biologische gerechten" (或任何相似的)并且应该针对每种情况显示相应的值是否存在(1)列中的标签'或不(0)。

我和'ddlyr'尝试了几件事。像:

df = mutate(df, biologisch = ifelse(Tags == "biologische gerechten", 1, 0))

R会创建一个新列' biologisch',但它只包含零。是否有另一种方法来分离所有值,然后为所有可能的值创建虚拟变量?希望有人能帮助我,谢谢!

1 个答案:

答案 0 :(得分:3)

这是一个解决方案:

# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
                 Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
                 stringsAsFactors = FALSE)

# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)

# use strsplit to break your jumbles of tags into separate elements, with a 
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")

# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))

# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results; 
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))

# assign the tags as column names
names(df2) <- tags

瞧:

> df2
   Y F P C Z K A J U H M O L E S R T Q V B I X G
1  1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2  0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3  0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4  0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5  0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6  0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7  0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8  0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9  1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1