创建虚拟变量后与unnest_tokens相反

时间:2018-02-20 18:27:13

标签: r tidytext

library(NLP)
library(tm)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
#sample dataset
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web, stringsAsFactors = FALSE)
#tokenize the words
toke <- tags2 %>%
  unnest_tokens(word, tags)
toke
#create a dummy variable
toke2 <- toke%>% mutate(
  product = ifelse(str_detect(word, "^product$"), "1", "0"))
#unnest the toke
nested_toke <- toke2 %>%
  nest(word) %>%
  mutate(text = map(data, unlist), 
         text = map_chr(text, paste, collapse = " "))

nested_toke %>%
  select(text)

当我根据字符串&#34; product&#34;创建虚拟变量后嵌套标记化单词列。它似乎是插入&#34;产品&#34;进入原始行下面的新行&#34; product&#34;找到了。

product underlined should be in the row above

1 个答案:

答案 0 :(得分:0)

在取消后添加新列时,如果要再次嵌套,则必须考虑如何处理它。让我们通过它,看看我们在谈论什么。

library(tidyverse)
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web)

library(tidytext)
tidy_tags <- tags2 %>%
    unnest_tokens(word, tags)
tidy_tags
#> # A tibble: 3 x 2
#>   web                           word            
#>   <chr>                         <chr>           
#> 1 hardware, sunglasses, eyeware product         
#> 2 hardware, sunglasses, eyeware productdesign   
#> 3 hardware, sunglasses, eyeware electronicdevice

这样,您的数据集就会被取消,转换为整齐的形式。接下来,让我们添加新列,检测单词"product"是否在word列中。

tidy_product <- tidy_tags %>% 
    mutate(product = ifelse(str_detect(word, "^product$"), 
                            TRUE, 
                            FALSE))
tidy_product
#> # A tibble: 3 x 3
#>   web                           word             product
#>   <chr>                         <chr>            <lgl>  
#> 1 hardware, sunglasses, eyeware product          T      
#> 2 hardware, sunglasses, eyeware productdesign    F      
#> 3 hardware, sunglasses, eyeware electronicdevice F

现在再考虑一下你的选择是什么。如果再次嵌套而不考虑新列(nest(word)),则结构具有新列,并且必须创建新行以考虑可以采用的两个不同值。您可以改为执行nest(word, product)之类的操作,但TRUE/FALSE值最终会出现在文本字符串中。如果您想要恢复原始文本格式,则需要删除您创建的新列,因为在那里更改了行和列之间的关系。

nested_product <- tidy_product %>%
    select(-product) %>%
    nest(word) %>%
    mutate(text = map(data, unlist), 
           text = map_chr(text, paste, collapse = ", "))

nested_product
#> # A tibble: 1 x 3
#>   web                           data             text                     
#>   <chr>                         <list>           <chr>                    
#> 1 hardware, sunglasses, eyeware <tibble [3 × 1]> product, productdesign, …

reprex package(v0.2.0)创建于2018-02-22。