与unnest_tokens相反

时间:2017-10-13 16:44:38

标签: r tidyr tidyverse tidytext

这很可能是一个愚蠢的问题,但我用google搜索并搜索,无法找到解决方案。我认为这是因为我不知道用正确的方式来搜索我的问题。

我有一个数据框,我已在R中转换为整洁的文本格式,以摆脱停用词。我现在想要不整洁'该数据帧恢复为原始格式。

unexst_tokens的反向/反向命令是什么?

编辑:这是我正在使用的数据的样子。我试图复制Silge和Robinson的Tidy Text书中的分析,但使用的是意大利歌剧史诗。

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line)
sample_df

character line
FIGARO    Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA   Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE     Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!

我把它变成整洁的文字,这样我就可以摆脱停止的话:

tribble <- sample_df %>%
           unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
            anti_join(itstopwords)

现在我有这样的事情:

text    word
FIGARO  cinque
FIGARO  dieci
FIGARO  venti
FIGARO  trenta
...

我想让它回到角色名称的格式和相关的行来查看其他内容。基本上我希望文本采用与以前相同的格式,但删除了停用词。

2 个答案:

答案 0 :(得分:9)

不是一个愚蠢的问题!答案取决于你正在尝试做什么,但如果我想通过使用purrr的map函数对其整理后的形式进行一些处理后,我希望将文本恢复到原始形式,这将是我的典型方法。

首先,让我们从原始文本转到整理格式。

library(tidyverse)
library(tidytext)


tidy_austen <- janeaustenr::austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_austen
#> # A tibble: 725,055 x 3
#>                   book linenumber        word
#>                 <fctr>      <int>       <chr>
#>  1 Sense & Sensibility          1       sense
#>  2 Sense & Sensibility          1         and
#>  3 Sense & Sensibility          1 sensibility
#>  4 Sense & Sensibility          3          by
#>  5 Sense & Sensibility          3        jane
#>  6 Sense & Sensibility          3      austen
#>  7 Sense & Sensibility          5        1811
#>  8 Sense & Sensibility         10     chapter
#>  9 Sense & Sensibility         10           1
#> 10 Sense & Sensibility         13         the
#> # ... with 725,045 more rows

现在文字很整洁!但我们可以将它弄清楚,回到某种类似于其原始形式的东西。我通常使用来自tidyr的nest,然后来自purrr的一些map函数来处理此问题。

nested_austen <- tidy_austen %>%
  nest(word) %>%
  mutate(text = map(data, unlist), 
         text = map_chr(text, paste, collapse = " ")) 

nested_austen
#> # A tibble: 62,272 x 4
#>                   book linenumber              data
#>                 <fctr>      <int>            <list>
#>  1 Sense & Sensibility          1  <tibble [3 x 1]>
#>  2 Sense & Sensibility          3  <tibble [3 x 1]>
#>  3 Sense & Sensibility          5  <tibble [1 x 1]>
#>  4 Sense & Sensibility         10  <tibble [2 x 1]>
#>  5 Sense & Sensibility         13 <tibble [12 x 1]>
#>  6 Sense & Sensibility         14 <tibble [13 x 1]>
#>  7 Sense & Sensibility         15 <tibble [11 x 1]>
#>  8 Sense & Sensibility         16 <tibble [12 x 1]>
#>  9 Sense & Sensibility         17 <tibble [11 x 1]>
#> 10 Sense & Sensibility         18 <tibble [15 x 1]>
#> # ... with 62,262 more rows, and 1 more variables: text <chr>

在这种特殊情况下,文本最后会是什么样子?

nested_austen %>%
  select(text)
#> # A tibble: 62,272 x 1
#>                                                                   text
#>                                                                  <chr>
#>  1                                               sense and sensibility
#>  2                                                      by jane austen
#>  3                                                                1811
#>  4                                                           chapter 1
#>  5 the family of dashwood had long been settled in sussex their estate
#>  6  was large and their residence was at norland park in the centre of
#>  7      their property where for many generations they had lived in so
#>  8 respectable a manner as to engage the general good opinion of their
#>  9 surrounding acquaintance the late owner of this estate was a single
#> 10  man who lived to a very advanced age and who for many years of his
#> # ... with 62,262 more rows

答案 1 :(得分:7)

library(tidyverse)
tidy_austen %>% 
     group_by(book,linenumber) %>% 
     summarise(text = str_c(word, collapse = " "))