我正在尝试使用R在字符串中查找流行词,这可能是最容易用示例解释的。
将此作为输入(包含数百万个条目,每个日期可以出现数千次)
IncorporationDate CompanyName 3007931 2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED 692999 2013-03-28 AGB SERVICES ANGLIA LIMITED 2255234 2008-05-22 CIDA INTERNATIONAL LIMITED 310577 2017-09-19 FA IT SERVICES LIMITED 2020738 2012-09-03 THE SPARES SHOP LIMITED 2776144 2006-02-03 ANGELVIEW PROPERTIES LIMITED 2420435 2017-10-17 SHANE WARD TM LIMITED 2523165 2014-06-04 THE INDEPENDENT GIN COMPANY LTD 2594847 2015-05-05 AIA ENGINEERING LTD 2701395 2015-05-27 LAURA BRIDGES LIMITED
我想找到每年使用的十大最受欢迎的单词,结果看起来像这样:
| Year | Top1 | Top1_Count | Top2 | Top2_Count | ... | ---- | ------- | ---------- | ---- | ---------- | | 2017 | LIMITED | 2 | IT | 1 | | ...
我到目前为止最接近的是:
words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))
但是丢失了年度数据,只在整个数据框中给出了完整的总数。
我也玩过dplyr的总结,但还没有办法让它做我想做的事。
编辑:使用来自@ maurits-evers的答案我已经进一步了解,并且发现前十名使用了这个:
top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)
试图找出如何将其变成我需要的形状
由于
答案 0 :(得分:1)
你可以这样做:
library(tidyverse);
df %>%
mutate(year = format(as.Date(IncorporationDate, format = "%Y-%m-%d"), "%Y")) %>%
group_by(year) %>%
mutate(words = strsplit(as.character(CompanyName), " ")) %>%
unnest() %>%
count(year, words);
# year words n
#<chr> <chr> <int>
#1 2003 BUSINESS 1
#2 2003 CONSULTANTS 1
#3 2003 LIMITED 1
#4 2003 OUTLANE 1
#5 2006 ANGELVIEW 1
#6 2006 LIMITED 1
#7 2006 PROPERTIES 1
#8 2008 CIDA 1
#9 2008 INTERNATIONAL 1
#10 2008 LIMITED 1
## ... with 26 more rows
说明:从IncorporationDate
提取年份,按year
分组,将CompanyName
分为words
,unnest
和count
每words
year
。
df <- read.table(text =
"IncorporationDate CompanyName
3007931 2003-05-12 'OUTLANE BUSINESS CONSULTANTS LIMITED'
692999 2013-03-28 'AGB SERVICES ANGLIA LIMITED'
2255234 2008-05-22 'CIDA INTERNATIONAL LIMITED'
310577 2017-09-19 'FA IT SERVICES LIMITED'
2020738 2012-09-03 'THE SPARES SHOP LIMITED'
2776144 2006-02-03 'ANGELVIEW PROPERTIES LIMITED'
2420435 2017-10-17 'SHANE WARD TM LIMITED'
2523165 2014-06-04 'THE INDEPENDENT GIN COMPANY LTD'
2594847 2015-05-05 'AIA ENGINEERING LTD'
2701395 2015-05-27 'LAURA BRIDGES LIMITED'", header = T)