在数据框中按行进行文本挖掘

时间:2017-02-20 11:32:43

标签: r text word-count mining

我有这个数据框:

> str(final)
'data.frame':   112 obs. of  3 variables:
 $ FAO_CountryName: chr  Algeria  Egypt  Libya  Morocco ...
 $ FAO_CountryURL : chr  "http://www.fao.org/giews/countrybrief/country.jsp?code=DZA" "http://www.fao.org/giews/countrybrief/country.jsp?code=EGY" "http://www.fao.org/giews/countrybrief/country.jsp?code=LBY" "http://www.fao.org/giews/countrybrief/country.jsp?code=MAR" ...
 $ Text           : chr  "\r\n   Reference Date: 24-November-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n          "| __truncated__ "\r\n   Reference Date: 28-November-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n          "| __truncated__ "\r\n   Reference Date: 15-November-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n          "| __truncated__ "\r\n   Reference Date: 21-September-2016\r\n   \r\n   \r\n               FOOD SECURITY SNAPSHOT\r\n               \r\n         "| __truncated__ ...

我想以一种方式处理Text变量,例如,我可以计算一个单词逐行出现的次数。 换句话说,我想得到一个数据框如下:

> head(final, n=2)
  FAO_CountryName   FAO_CountryURL             Text                    WordCount 
  Algeria            http://www.fao.org…       Algeria is nice…          Algeria  1 
                                                                              is  1
                                                                             ...
  Egypt              http://www.fao.org…       Egypt is nice too…          Egypt    1  
                                                                              is    5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                              ...

然而,我已经这样做了:

## Counting the words included in the textual dataset.
   keywords <- text_df %>% 
   unnest_tokens(word, text) %>% 
   count(word, sort = TRUE) %>%
   ungroup()

## Scoring the textual frequencies into the textual dataset (i.e. how many times the words are present)
   total_words <- keywords %>% 
   group_by(word) %>% 
   summarize(total = sum(n))

然而,这样我只能获得所有列的字数,而不是ROW BY ROW。 有什么建议吗?

0 个答案:

没有答案