我在R中有一个数据框,其中有多列带有多词的文本响应,看起来像这样:
1a 1b 1c 2a 2b 2c
student job prospects money professors students campus
future career unsure my grades opportunities university
success reputation my job earnings courses unsure
我希望能够计算组合的1a,1b和1c列以及组合的2a,2b和2b中单词的出现频率。
当前,我正在使用此代码分别计算每列中的单词频率。
data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))
理想情况下,我希望能够将两组列合并为仅两列,然后使用相同的代码来计算词频,但是我愿意接受其他选择。
合并后的列如下所示:
1 2
student professors
future my grades
success earnings
job prospects students
career opportunities
reputation courses
money campus
unsure university
my job unsure
答案 0 :(得分:0)
这是使用dplyr
和tidyr
软件包的一种方法。仅供参考,请避免使用以数字开头的列名。从长远来看,将它们命名为a1
,a2
...将使事情变得更容易。
df %>%
gather(variable, value) %>%
mutate(variable = substr(variable, 1, 1)) %>%
mutate(id = ave(variable, variable, FUN = seq_along)) %>%
spread(variable, value)
id 1 2
1 1 student professors
2 2 future my grades
3 3 success earnings
4 4 job prospects students
5 5 career opportunities
6 6 reputation courses
7 7 money campus
8 8 unsure university
9 9 my job unsure
数据-
df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects",
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students",
"opportunities", "courses"), `2c` = c("campus", "university",
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA,
-3L))
答案 1 :(得分:0)
通常,应避免以数字开头的列名。除此之外,我创建了您的问题的可复制示例,并使用dplyr
和tidyr
提供了解决方案。 substr()
中的mutate_at
函数假定您的列名在示例中遵循[num] [char]模式。
library(dplyr)
library(tidyr)
data <- tibble::tribble(
~`1a`, ~`1b`, ~`1c`, ~`2a`, ~`2b`, ~`2c`,
'student','job prospects', 'mone', 'professor', 'students', 'campus',
'future', 'career', 'unsure', 'my grades', 'opportunities', 'university',
'success', 'reputation', 'my job', 'earnings', 'courses', 'unsure'
)
data %>%
gather(key, value) %>%
mutate_at('key', substr, 0, 1) %>%
group_by(key) %>%
mutate(id = row_number()) %>%
spread(key, value) %>%
select(-id)
# A tibble: 9 x 2
`1` `2`
<chr> <chr>
1 student professor
2 future my grades
3 success earnings
4 job prospects students
5 career opportunities
6 reputation courses
7 mone campus
8 unsure university
9 my job unsure
答案 2 :(得分:0)
如果您的最终目的是计算频率(而不是从宽格式切换到长格式),则可以
ave(unlist(df[,paste0("a",1:3)]), unlist(df[,paste0("a",1:3)]), FUN = length)
它将计算列a1,a2,a3
的元素的频率,其中df
表示数据帧(列标记为a1,a2,a3,b1,b2,b3
)。