Question

我需要一些帮助在R中设置代码来解决问题：

我想给R提供一些字符串数据作为输入，其中包含几个单词（短语，推文，无论您想要什么）。字符串也可以有多个“”或“，”作为分隔符。

样本输入数据

我希望R为所有输入字符串中的每个唯一单词设置一个变量，并在字符串包含此特定单词时将其设置为1（或TRUE，或其他任何值）。

所以我想要的输出看起来像这样：

样本输出

列中的空白应包含0，为便于阅读，我将其省略。

说实话，我不是循环专家，并且认为使用软件包可能会更简单。感谢您的站点对此主题的支持，因为我有几个不同的项目，这些解决方案可以为我节省很多时间。

编辑：我想保留原始ID和字符串以进行进一步处理。

Answer 1

首先，对于以后的帖子，请以可复制且可复制和粘贴的格式提供示例数据。屏幕截图不是一个好主意，因为我们无法轻松地从图像中提取数据。有关更多详细信息，请查看如何提供minimal reproducible example/attempt。

此外，这是一个tidyverse解决方案

library(tidyverse)
df %>%
    separate_rows(Text, sep = " ") %>%
    mutate(n = 1) %>%
    pivot_wider(names_from = "Text", values_from = "n", values_fill = list(n = 0))
## A tibble: 5 x 6
#  ID      Peanut Butter Jelly Storm  Wind
#  <fct>    <dbl>  <dbl> <dbl> <dbl> <dbl>
#1 ID-0001      1      1     1     0     0
#2 ID-0002      1      0     0     0     0
#3 ID-0003      0      1     0     0     0
#4 ID-0004      0      0     0     1     0
#5 ID-0005      0      1     0     1     1

说明：我们使用separare_rows在空白处的Text中拆分条目，并将数据重整为长格式；然后，我们添加一个计数列；最后，我们使用pivot_wider从长到宽重塑数据，并使用0填充缺失值。

或者在基数R中使用xtabs

df2 <- transform(df, Text = strsplit(as.character(Text), " "))
xtabs(n ~ ., data.frame(
    ID = with(df2, rep(ID, vapply(Text, length, 1L))),
    Text = unlist(df2$Text),
    n = 1))
#ID        Butter Jelly Peanut Storm Wind
#  ID-0001      1     1      1     0    0
#  ID-0002      0     0      1     0    0
#  ID-0003      1     0      0     0    0
#  ID-0004      0     0      0     1    0
#  ID-0005      1     0      0     1    1

样本数据

df <- read.table(text =
"ID Text
ID-0001   'Peanut Butter Jelly'
ID-0002   Peanut
ID-0003   Butter
ID-0004   Storm
ID-0005   'Storm Wind Butter'", header = T)

Answer 2

在base R中，您所需的两步式解决方案应如下所示：

# Extract all words, keep only unique words, sort in alphabetic order:
all_words <- sort(unique(unlist(strsplit(df$strings, "\\W"))))

# Fill columns with 1 or 0 depending on whether the word is present in each string
cbind(df, sapply(all_words, function(x) 1 * grepl(x, df$strings)))
#>       ID             strings Butter Jelly Peanut Storm Wind
#> 1 ID0001 Peanut Butter Jelly      1     1      1     0    0
#> 2 ID0002              Peanut      0     0      1     0    0
#> 3 ID0003              Butter      1     0      0     0    0
#> 4 ID0004               Storm      0     0      0     1    0
#> 5 ID0005   Storm Wind Butter      1     0      0     1    1

使用的数据：

df <- structure(list(ID = c("ID0001", "ID0002", "ID0003", "ID0004", 
      "ID0005"), strings = c("Peanut Butter Jelly", "Peanut", "Butter", 
      "Storm", "Storm Wind Butter")), class = "data.frame", row.names = c(NA, -5L))

^{由reprex package（v0.3.0）于2020-02-25创建}

R：将字符串拆分为不同的变量，如果字符串包含该单词，则分配1

2 个答案:

样本数据