R中的文本挖掘错误:二元运算符的非数字参数

时间:2018-04-24 16:06:32

标签: r ggplot2 dplyr text-mining tidyr

我已经完成了谷歌搜索,查看了当前勘误表的书籍,并在堆栈溢出中搜索了错误,但没有找到答案。我正在阅读第4-10页的书中。

这部分运行良好:

 original_books <- austen_books() %>%
 group_by(book) %>%
 mutate(linenumber = row_number(),
     chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                             ignore_case = TRUE)))) %>%
ungroup()
original_books

tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books

data(stop_words)

tidy_books<- tidy_books %>%
  anti_join(stop_words)

tidy_books %>%
  count(word, sort = TRUE)

tidy_books %>%
  count(word, sort= TRUE) %>%
filter(n>600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()

hgwells <- gutenberg_download(c(35, 36, 5230, 159))

tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

tidy_hgwells %>%
count(word, sort=TRUE) 

bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))  

tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

tidy_bronte %>%
count(word, sort=TRUE)

frequency <- bind_rows(mutate(tidy_bronte, author="Bronte Sisters"),
                   mutate(tidy_hgwells, author = "H.G. Wells"),
                   mutate(tidy_books, author = "Jane Austen")) %>%
        mutate(word = str_extract(word, "[a-z']+")) %>%
        count(author, word) %>%
        group_by(author) %>%
                 mutate(proportion = n / sum(n)) %>%
                 select(-n) %>%
                 spread(author, proportion) %>%
                 gather(author, proportion, 'Bronte Sisters':'H.G. Wells')
   frequency

但是当我运行这段代码时:

ggplot(frequency, aes(x=proportion, y='Jane Austen', 
                  color=abs('Jane Austen' - proportion))) +
geom_abline(color="gray40", lty=2) +
geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
geom_text(aes(label= word), check_overlap=TRUE, vjust=1.5) +
scale_x_log10(labels= percent_format()) +
scale_y_log10(labels= percent_format()) +
scale_color_gradient(limits= c(0, 0.001), 
                   low= "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol=2) +
theme(legend.position="none") +
labs(y="Jane Austen", x=NULL) 

我收到此错误:“Jane Austen”中的错误 - 比例:   二元运算符的非数字参数

这是频率的结构:

> str(frequency)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   57818 obs. of  4 variables:
 $ word       : chr  "a" "a'most" "a'n't" "aback" ...
 $ Jane Austen: num  9.19e-06 NA 4.60e-06 NA NA ...
 $ author     : chr  "Bronte Sisters" "Bronte Sisters" "Bronte Sisters" 
               "Bronte Sisters" ...
 $ proportion : num  3.19e-05 1.59e-05 NA 3.98e-06 3.98e-06 ...

比例和简奥斯汀有数值,但也有NA。我试图删除它们,但它没有帮助,而且我认为这本书会把它作为一个潜在的问题。

这些是我正在使用的库。当我运行它们时,我没有看到任何可能掩盖函数的冲突:

library(dplyr)
library(tidytext)
library(janeaustenr)
library(stringr)
library(tidyr)
library(ggplot2)
library(gutenbergr)
library(scales)

我在Windows 10上使用RStudio版本1.1.442。我正在使用R 3.4.4

关于什么可能出错的任何想法?

1 个答案:

答案 0 :(得分:3)

您的问题很容易被忽视。你需要在简奥斯汀附近引用“反引号”。 Jane Austen在这种情况下不是名称,而是frequency中的列名。带空格的列名需要反引号。

应该是:

ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
.....

ggplot(frequency, aes(x=proportion, y='Jane Austen', color = abs('Jane Austen' - proportion))) +
.....