我有一个数据框,其中包含名为msgText的列,其中包含文本。为此,我想创建另一个名为“ wordcount”的列,该列计算“ msgText”的每一行包含多少个单词。
该列如下所示:
head(all_transcripts$msgText)
[1] "Hi, my name is Chris and I am a programmer"
[2] "I have worked with R for 12 years"
[3] "Being a programmer I have many questions"
[4] "The fellow programmers at Stackoverflow help me to get the answer"
[5] "This help has saved my life many times."
[6] "Thanks Stackoverflow!"
我想要的结果是:
head(all_transcripts$wordcount)
[1] 10
[2] 8
[3] 7
[4] 11
[5] 8
[6] 2
为此,我将ngram
库与wordcount
函数一起使用。
我尝试过:
all_transcripts$wordcount <- apply(all_transcripts, 2,
wordcount(all_transcripts$msgText))
但是,这样做时出现以下错误:
Error in match.fun(FUN) : 'wordcount(all_transcripts$msgText)' is
not a function, character or symbol
如何在数据集上不必使用for循环的情况下正确使用apply
函数?
答案 0 :(得分:2)
我们可以遍历'msgText'的元素并应用wordcount
函数
library(ngram)
library(tidyverse)
all_transcripts %>%
mutate(wordcount = map_int(msgText, wordcount))
# msgText wordcount
#1 Hi, my name is Chris and I am a programmer 10
#2 I have worked with R for 12 years 8
#3 Being a programmer I have many questions 7
#4 The fellow programmers at Stackoverflow help me to get the answer 11
#5 This help has saved my life many times. 8
#6 Thanks Stackoverflow! 2
或与base R
all_transcripts$wordcount <- sapply(all_transcripts$msgText, wordcount)
OP代码中的问题是它遍历列(MARGIN = 2
中的apply
),其中向量(alltranscripts$wordcount
)没有dim
属性
all_transcripts <- structure(list(msgText = c("Hi, my name is Chris and I am a programmer",
"I have worked with R for 12 years", "Being a programmer I have many questions",
"The fellow programmers at Stackoverflow help me to get the answer",
"This help has saved my life many times.", "Thanks Stackoverflow!"
)), class = "data.frame", row.names = c(NA, -6L))
答案 1 :(得分:0)
考虑使用lengths
和矢量化的strsplit
使用基数R进行单词计数:
all_transcripts$word_count <- lengths(strsplit(all_transcripts$text, split=" "))
all_transcripts
# text word_count
# 1 Hi, my name is Chris and I am a programmer 10
# 2 I have worked with R for 12 years 8
# 3 Being a programmer I have many questions 7
# 4 The fellow programmers at Stackoverflow help me to get the answer 11
# 5 This help has saved my life many times. 8
# 6 Thanks Stackoverflow! 2
数据
all_transcripts <- data.frame(text=c("Hi, my name is Chris and I am a programmer",
"I have worked with R for 12 years",
"Being a programmer I have many questions",
"The fellow programmers at Stackoverflow help me to get the answer",
"This help has saved my life many times.",
"Thanks Stackoverflow!"),
stringsAsFactors=FALSE)