R中字符向量中的平均字数

时间:2014-03-12 10:25:41

标签: r dataframe character word-count

我想在R

中获取我的角色向量中的平均单词数
one <- c(9, 23, 43)
two <- c("this is a new york times article.", "short article.", "he went outside to smoke a cigarette.")

mydf <- data.frame(one, two)
mydf

#   one                                   two
# 1   9     this is a new york times article.
# 2  23                        short article.
# 3  43 he went outside to smoke a cigarette.

我正在寻找一个函数,它给出了字符向量“2”的平均单词数。

此处的输出应为5.3333(=(7 + 2 + 7)/ 3)

4 个答案:

答案 0 :(得分:3)

这里有qdap包的可能性:

library(qdap)
wc(mydf$two, FALSE)/nrow(mydf)

## [1] 5.333333

这有点矫枉过正,但你也可以这样做:

word_stats(mydf$two)

##   all n.sent n.words n.char n.syl n.poly   wps    cps   sps psps   cpw   spw pspw n.state proDF2 n.hapax n.dis grow.rate prop.dis
## 1 all      3      16     68    23      3 5.333 22.667 7.667    1 4.250 1.438 .188       3      1      12     2      .750     .125

wps列是每个句子的单词。

答案 1 :(得分:2)

Hadley Wickham的stringr包可能是最简单的方法:

library(stringr)
foo<- str_split(two, " ") # split each element of your vector by the space sign
sapply(foo,length) # just a quick test: how many words has each element?
sum(sapply(foo,length))/length(foo) # calculate sum and divide it by the length of your original object
[1] 5.333333

答案 2 :(得分:2)

我确信有一些更详细的方法可用,但您可以使用strsplit将空格中的字符串拆分为字符向量并计算其元素长度。

mean(sapply(strsplit(as.character(mydf$two), "[[:space:]]+"), length))
# [1] 5.3333

答案 3 :(得分:2)

gregexpr()

mean(sapply(mydf$two,function(x)length(unlist(gregexpr(" ",x)))+1))
[1] 5.333333