Question

我想在R

中获取我的角色向量中的平均单词数

one <- c(9, 23, 43)
two <- c("this is a new york times article.", "short article.", "he went outside to smoke a cigarette.")

mydf <- data.frame(one, two)
mydf

#   one                                   two
# 1   9     this is a new york times article.
# 2  23                        short article.
# 3  43 he went outside to smoke a cigarette.

我正在寻找一个函数，它给出了字符向量“2”的平均单词数。

此处的输出应为5.3333（=（7 + 2 + 7）/ 3）

Answer 1

这里有qdap包的可能性：

library(qdap)
wc(mydf$two, FALSE)/nrow(mydf)

## [1] 5.333333

这有点矫枉过正，但你也可以这样做：

word_stats(mydf$two)

##   all n.sent n.words n.char n.syl n.poly   wps    cps   sps psps   cpw   spw pspw n.state proDF2 n.hapax n.dis grow.rate prop.dis
## 1 all      3      16     68    23      3 5.333 22.667 7.667    1 4.250 1.438 .188       3      1      12     2      .750     .125

wps列是每个句子的单词。

Answer 2

Hadley Wickham的stringr包可能是最简单的方法：

library(stringr)
foo<- str_split(two, " ") # split each element of your vector by the space sign
sapply(foo,length) # just a quick test: how many words has each element?
sum(sapply(foo,length))/length(foo) # calculate sum and divide it by the length of your original object
[1] 5.333333

Answer 3

我确信有一些更详细的方法可用，但您可以使用strsplit将空格中的字符串拆分为字符向量并计算其元素长度。

mean(sapply(strsplit(as.character(mydf$two), "[[:space:]]+"), length))
# [1] 5.3333

Answer 4

或gregexpr()

mean(sapply(mydf$two,function(x)length(unlist(gregexpr(" ",x)))+1))
[1] 5.333333

R中字符向量中的平均字数

4 个答案: