我正在看一本小说,想在整本书中寻找人物名字的出现。有些人物使用不同的名字。例如,字符“ Sissy Jupe”与“ Sissy”和“ Jupe”相对应。我想将两行字数合并为一个,以便可以看到“ Sissy Jupe”的计数。
我已经研究过使用留言板使用sum,rbind,merge和其他方法,但是似乎没有任何效果。有很多很棒的例子,但它们没有用。
library(tidyverse)
library(gutenbergr)
library(tidytext)
ht <- gutenberg_download(786)
ht_chap <- ht %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
tidy_ht <- ht_chap %>%
unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)
ht_count <- tidy_ht %>%
group_by(chapter) %>%
count(word, sort = TRUE) %>%
ungroup %>%
complete(chapter, word,
fill = list(n = 0))
gradgrind <- filter(ht_count, word == "gradgrind")
bounderby <- filter (ht_count, word == "bounderby")
sissy <- filter (ht_count, word == "sissy")
## TEST
sissy_jupe <- ht_count %>%
filter(word %in% c("sissy", "jupe"))
我想要一个名为“ sissy_jupe”的“单词”项,按章对n进行计数。 这很接近,但事实并非如此。
# A tibble: 76 x 3
chapter word n
<int> <chr> <dbl>
1 0 jupe 0
2 0 sissy 1
3 1 jupe 0
4 1 sissy 0
5 2 jupe 5
6 2 sissy 9
7 3 jupe 3
8 3 sissy 1
9 4 jupe 1
10 4 sissy 0
# … with 66 more rows
答案 0 :(得分:1)
下面的代码应该为您提供所需的输出。
library(tidyverse)
df %>% group_by(chapter) %>%
mutate(n = sum(n),
word = paste(word, collapse="_")) %>%
distinct(chapter, .keep_all = T)
答案 1 :(得分:0)
欢迎来到stackoverflow汤姆。这是一个主意:
基本上,(1)在整理的小标题中找到“ sissy”或“ jupe”,并替换为“ sissy_jupe”,(2)像您一样创建ht_count,(3)打印结果:
library(tidyverse)
library(gutenbergr)
library(tidytext)
ht <- gutenberg_download(786)
ht_chap <- ht %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE))))
tidy_ht <- ht_chap %>%
unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) # preserves online letters; removes _)
# NEW CODE START
tidy_ht <- tidy_ht %>%
mutate(word = str_replace_all(word, "sissy|jupe", replacement = "sissy_jupe"))
# END NEW CODE
ht_count <- tidy_ht %>%
group_by(chapter) %>%
count(word, sort = TRUE) %>%
ungroup %>%
complete(chapter, word,
fill = list(n = 0))
# NEW CODE
sissy_jupe <- ht_count %>%
filter(str_detect(word, "sissy_jupe"))
# END
...产生...
# A tibble: 38 x 3
chapter word n
<int> <chr> <dbl>
1 0 sissy_jupe 1
2 1 sissy_jupe 0
3 2 sissy_jupe 14
4 3 sissy_jupe 4
5 4 sissy_jupe 1
6 5 sissy_jupe 5
7 6 sissy_jupe 20
8 7 sissy_jupe 7
9 8 sissy_jupe 2
10 9 sissy_jupe 38
# ... with 28 more rows
如果我们的解决方案对您有所帮助(反馈=更好的编码器),请不要忘了投票/单击复选标记。