Question

我一直在研究这个课程问题，最后得到了测验所需要的答案。我对R来说相当新，不到5周，但这花了我几个小时才明白。我的任务是从 The Jungle 中找到Jurgis，Ona和Chicago的所有名称。

问题：我浪费了很多时间使用gsub删除标点符号，但稍后意识到某些元素是两个单词：＆＃34; Jurgis阅读＆＃34;会凝结成＆＃34; Jurgisread＆＃34;并且不会拿起计数。然后是＆＃34; Jurgis＆＃34;＆＃34; Jurgiss＆＃34;等为Ona和芝加哥城市等。

想要：关于如何更好地处理这些类型的文件的一些提示。

我做了什么：我开始使用前两行代码。我使用它们附带的空格分割元素。然后我选择了我想删除的标点符号。一旦我删除，我想，将是所有常见的，并用空格替换它们，再次拆分元素。最后，我用table（）强制所有的单词都是大写的。

 theJungle <- readLines("http://www.gutenberg.org/files/140/140.txt")
 theJungleList <- unlist(strsplit(theJungle[47:13872], " "))

splitJungle1<-unlist(strsplit(theJungleList, "[[:space:]]", fixed = FALSE, 
perl = FALSE, useBytes = FALSE))

remPunctuation<-gsub("-|'|,|:|;|\\.|\\*|\\(|\"|!|\\?"," ",splitJungle1)

splitJungle2<-unlist(strsplit(remPunctuation, "[[:space:]]", fixed = FALSE, perl 
= FALSE, useBytes = FALSE))

table(toupper(splitJungle2)=="JURGIS")
table(toupper(splitJungle2)=="ONA")
table(toupper(splitJungle2)=="CHICAGO")

谢谢！

enter image description here

Answer 1

如果这是一堂课，你可能应该使用某些技巧。如果您只是对R中的文本分析感兴趣，您可以考虑使用整洁的数据原则和tidytext包。在这种工作模式下，寻找单词频率是pretty quick thing to do。

library(dplyr)
library(tidytext)
library(stringr)

theJungle <- readLines("http://www.gutenberg.org/files/140/140.txt")
jungle_df <- data_frame(text = theJungle) %>%
    unnest_tokens(word, text)

文中最常用的词是什么？

jungle_df %>%
    count(word, sort = TRUE)

#> # A tibble: 10,349 × 2
#>     word     n
#>    <chr> <int>
#> 1    the  9114
#> 2    and  7350
#> 3     of  4484
#> 4     to  4270
#> 5      a  4217
#> 6     he  3312
#> 7    was  3056
#> 8     in  2570
#> 9     it  2318
#> 10   had  2234
#> # ... with 10,339 more rows

您多久会看到您要查找的具体名称？

jungle_df %>%
    count(word) %>%
    filter(str_detect(word, "^jurgis|^ona|^chicago"))

#> # A tibble: 6 × 2
#>        word     n
#>       <chr> <int>
#> 1   chicago    68
#> 2 chicago's     4
#> 3    jurgis  1098
#> 4  jurgis's    19
#> 5       ona   200
#> 6     ona's    25

R：查找所有出现的名称

1 个答案: