Question

我有一个看起来像这样的数据框：

sentences <- data.frame(sentences = 
                          c('You can apply for or renew your Medical Assistance benefits online by using COMPASS.',
                            'COMPASS is the name of the website where you can apply for Medical Assistance and many other services that can help you make ends meet.',
                          'Medical tourism refers to people traveling to a country other than their own to obtain medical treatment. In the past this usually referred to those who traveled from less-developed countries to major medical centers in highly developed countries for treatment unavailable at home.',
                          'Health tourism is a wider term for travel that focus on medical treatments and the use of healthcare services. It covers a wide field of health-oriented, tourism ranging from preventive and health-conductive treatment to rehabilitational and curative forms of travel.',
                          'Medical tourism carries some risks that locally provided medical care either does not carry or carries to a much lesser degree.',
                          'Receiving medical care abroad may subject medical tourists to unfamiliar legal issues. The limited nature of litigation in various countries is a reason for accessbility of care overseas.', 
                          'While some countries currently presenting themselves as attractive medical tourism destinations provide some form of legal remedies for medical malpractice, these legal avenues may be unappealing to the medical tourist.'))

我要做的就是在每一行中找到重要的单词，并创建一个新的列，看起来应该像这样：

sentences$ImpWords <- c("apply, renew, Medical, Assistance, benefits, online, COMPASS",
                    "COMPASS, name, website, apply, Medical, Assistance, services, help, meet") 

and so forth

我不确定该怎么做？

我正在尝试使用tm，tidytext等各种软件包来进行单词，清洁和预处理等操作，但是无法获得所需的结果。

还有其他选择吗？

Answer 1

这将实现您的追求。如果要删除更多单词，只需查找更大/不同的列表（许多可通过不同的软件包获得）。在这里，我使用了tm的英语停用词。

library(tm)
stopwords <- stopwords('en')

sentences <- data.frame(sentences = 
                          c('You can apply for or renew your Medical Assistance benefits online by using COMPASS.',
                            'COMPASS is the name of the website where you can apply for Medical Assistance and many other services that can help you make ends meet.',
                            'Medical tourism refers to people traveling to a country other than their own to obtain medical treatment. In the past this usually referred to those who traveled from less-developed countries to major medical centers in highly developed countries for treatment unavailable at home.',
                            'Health tourism is a wider term for travel that focus on medical treatments and the use of healthcare services. It covers a wide field of health-oriented, tourism ranging from preventive and health-conductive treatment to rehabilitational and curative forms of travel.',
                            'Medical tourism carries some risks that locally provided medical care either does not carry or carries to a much lesser degree.',
                            'Receiving medical care abroad may subject medical tourists to unfamiliar legal issues. The limited nature of litigation in various countries is a reason for accessbility of care overseas.', 
                            'While some countries currently presenting themselves as attractive medical tourism destinations provide some form of legal remedies for medical malpractice, these legal avenues may be unappealing to the medical tourist.'))


sentences[,"sentences"] <- sentences[,"sentences"] %>% as.character()


ImpWords <- c()
for (i in 1:nrow(sentences)) {

  originalWords <- gsub('[[:punct:] ]+',' ',sentences[i, "sentences"]) %>% trimws(.) %>% strsplit(., " ") 
  lowerCaseWords <- gsub('[[:punct:] ]+',' ',tolower(sentences[i, "sentences"])) %>% trimws(.) %>% strsplit(., " ")
  wordsNotInStopWords <- originalWords[[1]][which(!lowerCaseWords[[1]] %in% stopwords)]
  wordsNotInStopWordsGreaterThanThreeChar <- wordsNotInStopWords[which(nchar(wordsNotInStopWords) > 3)]
  ImpWords[i] <- paste(wordsNotInStopWordsGreaterThanThreeChar, collapse = ", ")

}

sentences$ImpWords <- ImpWords
sentences$ImpWords

Answer 2

如果您愿意，这里是一种使用整洁数据原理的方法。关于此方法的一个好处是，它在选择stopword dictionary时非常灵活。您可以通过get_stopwords()的参数将它们切换出来。

library(tidyverse)
library(tidytext)

sentences %>%
  mutate(line = row_number()) %>%
  unnest_tokens(word, sentences) %>%
  anti_join(get_stopwords(source = "smart")) %>%
  nest(word) %>%
  mutate(words = map(data, unlist),
         words = map_chr(words, paste, collapse = " "))

#> Joining, by = "word"
#> # A tibble: 7 x 3
#>    line data           words                                              
#>   <int> <list>         <chr>                                              
#> 1     1 <tibble [7 × … apply renew medical assistance benefits online com…
#> 2     2 <tibble [9 × … compass website apply medical assistance services …
#> 3     3 <tibble [23 ×… medical tourism refers people traveling country ob…
#> 4     4 <tibble [25 ×… health tourism wider term travel focus medical tre…
#> 5     5 <tibble [12 ×… medical tourism carries risks locally provided med…
#> 6     6 <tibble [18 ×… receiving medical care abroad subject medical tour…
#> 7     7 <tibble [17 ×… countries presenting attractive medical tourism de…

由reprex package（v0.2.0）于2018-08-14创建。

第一行创建一列以跟踪每个句子，然后下一行使用unnest_tokens()标记文本并将其转换为整齐的格式。然后，您可以通过anti_join()删除停用词。此后，最后两行是从整洁的数据格式（FYI确实具有您要查找的信息，只是格式不同）转换为您要讨论的数据结构。您可以根据需要使用data删除select(-data)列。

在文本数据框中找到按行的重要单词

2 个答案: