计算字符串中的单词,按年份分组

时间:2018-03-25 21:27:27

标签: r dataframe top-n

我正在尝试使用R在字符串中查找流行词,这可能是最容易用示例解释的。

将此作为输入(包含数百万个条目,每个日期可以出现数千次)

        IncorporationDate                          CompanyName
3007931        2003-05-12 OUTLANE BUSINESS CONSULTANTS LIMITED
692999         2013-03-28          AGB SERVICES ANGLIA LIMITED
2255234        2008-05-22           CIDA INTERNATIONAL LIMITED
310577         2017-09-19               FA IT SERVICES LIMITED
2020738        2012-09-03              THE SPARES SHOP LIMITED
2776144        2006-02-03         ANGELVIEW PROPERTIES LIMITED
2420435        2017-10-17                SHANE WARD TM LIMITED
2523165        2014-06-04      THE INDEPENDENT GIN COMPANY LTD
2594847        2015-05-05                  AIA ENGINEERING LTD
2701395        2015-05-27                LAURA BRIDGES LIMITED

我想找到每年使用的十大最受欢迎的单词,结果看起来像这样:

| Year | Top1    | Top1_Count | Top2 | Top2_Count | ...
| ---- | ------- | ---------- | ---- | ---------- | 
| 2017 | LIMITED | 2          | IT   | 1          |
| ...

我到目前为止最接近的是:

words <- data.frame(table(unlist(strsplit(tolower(df$SText, " "))))

但是丢失了年度数据,只在整个数据框中给出了完整的总数。

我也玩过dplyr的总结,但还没有办法让它做我想做的事。

编辑:使用来自@ maurits-evers的答案我已经进一步了解,并且发现前十名使用了这个:

top_words_by_year <- words_by_year %>% group_by(year) %>% top_n(n = 10, wt = n)

试图找出如何将其变成我需要的形状

由于

1 个答案:

答案 0 :(得分:1)

你可以这样做:

library(tidyverse);
df %>%
    mutate(year = format(as.Date(IncorporationDate, format = "%Y-%m-%d"), "%Y")) %>%
    group_by(year) %>%
    mutate(words = strsplit(as.character(CompanyName), " ")) %>%
    unnest() %>%
    count(year, words);
#  year  words             n
#<chr> <chr>         <int>
#1 2003  BUSINESS          1
#2 2003  CONSULTANTS       1
#3 2003  LIMITED           1
#4 2003  OUTLANE           1
#5 2006  ANGELVIEW         1
#6 2006  LIMITED           1
#7 2006  PROPERTIES        1
#8 2008  CIDA              1
#9 2008  INTERNATIONAL     1
#10 2008  LIMITED           1
## ... with 26 more rows

说明:从IncorporationDate提取年份,按year分组,将CompanyName分为wordsunnestcountwords year

样本数据

df <- read.table(text =
    "IncorporationDate                          CompanyName
3007931        2003-05-12 'OUTLANE BUSINESS CONSULTANTS LIMITED'
692999         2013-03-28          'AGB SERVICES ANGLIA LIMITED'
2255234        2008-05-22           'CIDA INTERNATIONAL LIMITED'
310577         2017-09-19               'FA IT SERVICES LIMITED'
2020738        2012-09-03              'THE SPARES SHOP LIMITED'
2776144        2006-02-03         'ANGELVIEW PROPERTIES LIMITED'
2420435        2017-10-17                'SHANE WARD TM LIMITED'
2523165        2014-06-04      'THE INDEPENDENT GIN COMPANY LTD'
2594847        2015-05-05                  'AIA ENGINEERING LTD'
2701395        2015-05-27                'LAURA BRIDGES LIMITED'", header = T)