计算R中多列的词频

时间:2018-11-15 23:52:18

标签: r dataframe text nlp

我在R中有一个数据框,其中有多列带有多词的文本响应,看起来像这样:

1a        1b             1c       2a          2b             2c
student   job prospects  money    professors  students       campus
future    career         unsure   my grades   opportunities  university
success   reputation     my job   earnings    courses        unsure

我希望能够计算组合的1a,1b和1c列以及组合的2a,2b和2b中单词的出现频率。

当前,我正在使用此代码分别计算每列中的单词频率。

data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))

理想情况下,我希望能够将两组列合并为仅两列,然后使用相同的代码来计算词频,但是我愿意接受其他选择。

合并后的列如下所示:

1              2
student        professors
future         my grades
success        earnings
job prospects  students
career         opportunities
reputation     courses
money          campus
unsure         university
my job         unsure

3 个答案:

答案 0 :(得分:0)

这是使用dplyrtidyr软件包的一种方法。仅供参考,请避免使用以数字开头的列名。从长远来看,将它们命名为a1a2 ...将使事情变得更容易。

df %>% 
  gather(variable, value) %>% 
  mutate(variable = substr(variable, 1, 1)) %>% 
  mutate(id = ave(variable, variable, FUN = seq_along)) %>%
  spread(variable, value)

  id             1             2
1  1       student    professors
2  2        future     my grades
3  3       success      earnings
4  4 job prospects      students
5  5        career opportunities
6  6    reputation       courses
7  7         money        campus
8  8        unsure    university
9  9        my job        unsure

数据-

df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects", 
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students", 
"opportunities", "courses"), `2c` = c("campus", "university", 
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA, 
-3L))

答案 1 :(得分:0)

通常,应避免以数字开头的列名。除此之外,我创建了您的问题的可复制示例,并使用dplyrtidyr提供了解决方案。 substr()中的mutate_at函数假定您的列名在示例中遵循[num] [char]模式。

library(dplyr)
library(tidyr)

data <- tibble::tribble(
  ~`1a`, ~`1b`, ~`1c`, ~`2a`, ~`2b`, ~`2c`,
  'student','job prospects', 'mone', 'professor', 'students', 'campus',
  'future', 'career', 'unsure', 'my grades', 'opportunities',  'university',
  'success', 'reputation', 'my job', 'earnings', 'courses', 'unsure'
)

data %>%
  gather(key, value) %>%
  mutate_at('key', substr, 0, 1) %>%
  group_by(key) %>%
  mutate(id = row_number()) %>%
  spread(key, value) %>%
  select(-id)

# A tibble: 9 x 2
  `1`           `2`          
  <chr>         <chr>        
1 student       professor    
2 future        my grades    
3 success       earnings     
4 job prospects students     
5 career        opportunities
6 reputation    courses      
7 mone          campus       
8 unsure        university   
9 my job        unsure    

答案 2 :(得分:0)

如果您的最终目的是计算频率(而不是从宽格式切换到长格式),则可以

ave(unlist(df[,paste0("a",1:3)]), unlist(df[,paste0("a",1:3)]), FUN = length)

它将计算列a1,a2,a3的元素的频率,其中df表示数据帧(列标记为a1,a2,a3,b1,b2,b3)。