如何计算字符串中字符的频率?

时间:2016-01-18 06:36:08

标签: r string frequency

我的data.frame包含有关个人完成的移动的信息以及表示数据库中这些移动的字符串(字母字符)。其结构如下:

MovementAnalysis <- structure(list(Strings = c("AaB", "cZhH", "Bb", "bAc"), Descriptor = c("Jog/ Stop/ Turn", "Change/ Shuffle/ Backwards/ Jump", "Turn/ Duck", "Duck/ Jog/ Change"), Person = c("Sally", "Sally", "Ben", "Ben")), .Names = c("Strings", "Descriptor", "Person"), row.names = c(NA, 4L), class = "data.frame")

我希望在每个Strings的所有Person内捕获每个字母字母的频率(例如:A,a,B,b)。有48个alpha大写和小写字母。我的实际data.frame包含100 +个人的移动,因此迭代每个人的快速解决方案将是理想的。举个例子,我的预期输出是:

Output <- structure(list(Person = c("Sally", "Sally", "Sally", "Sally", "Ben", "Ben", "Ben", "Ben"), Letter = c("A", "a", "B", "b", "A", "a", "B", "b"), Frequency = c(1, 1, 1, 0, 1, 0, 1, 2)), .Names = c("Person", "Letter", "Frequency"), row.names = c(NA, 8L), class = "data.frame")

谢谢!

3 个答案:

答案 0 :(得分:1)

一种选择是使用data.table

library(data.table)
df2 <- setDT(df1)[,list(Letter={
   tmp <- unlist(strsplit(Strings, ''))
   factor(tmp[tmp %in% c("A", "a", "B", "b")], 
        levels=c("A", "a", "B", "b"))}) , Person]
df2[, ind:="Frequency"]
dcast(df2, Person+Letter~ind, value.var="Letter", length, drop=FALSE)
#   Person Letter Frequency
#1:    Ben      A         1
#2:    Ben      a         0
#3:    Ben      B         1
#4:    Ben      b         2
#5:  Sally      A         1
#6:  Sally      a         1
#7:  Sally      B         1
#8:  Sally      b         0

答案 1 :(得分:1)

比akrun的答案更少灵魂,但我认为它有效:

your.func <- function(data) {
    require(dplyr)
    bag.of.letters <- function(strings) {
        concat.string <- paste(strings, collapse='')
        all.chars.vec <- unlist(strsplit(concat.string,""))
        result <- data.frame(table(factor(all.chars.vec,levels = c(letters,LETTERS))))
        colnames(result) <- c("Letter","Frequency")
        result[order(result[["Letter"]]),]
    }
    lapply(X = unique(data[["Person"]]), 
           FUN = function(n) {
               strings = data %>% filter(Person == n) %>% .[["Strings"]]
               data.frame(Person = n, bag.of.letters(strings))
           }) %>% do.call(rbind,.)
}

your.func(MovementAnalysis)

如果您想在Letter列中仅包含具有正频率的字母,请移除factor(..., levels = c(letters,LETTERS))部分。

答案 2 :(得分:0)

这是使用我的&#34; splitstackshape&#34;中cSplit_e的选项。包。我把它和#34; magrittr&#34;这样您就可以在不必存储任何中间对象或创建长嵌套表达式的情况下完成这些步骤。

第一个选项显示如何获得&#34;宽&#34;形式,如@alistaire所述。

library(splitstackshape)
library(magrittr)

data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
  cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
  .[, lapply(.SD, sum), by = Person] %>%
  subset(select = grep("Person|_[AaBb]$", names(.)))
#    Person Strings_a Strings_A Strings_b Strings_B
# 1:  Sally         1         1         0         1
# 2:    Ben         0         1         2         1

要从上面转到长格式,您只需要添加melt行。

data.table(subset(MovementAnalysis, select = -Descriptor)) %>%
  cSplit_e("Strings", "", type = "character", drop = TRUE, fill = 0) %>%
  .[, lapply(.SD, sum), by = Person] %>%
  subset(select = grep("Person|_[AaBb]$", names(.))) %>%
  melt(id.vars = "Person")
#    Person  variable value
# 1:  Sally Strings_a     1
# 2:    Ben Strings_a     0
# 3:  Sally Strings_A     1
# 4:    Ben Strings_A     1
# 5:  Sally Strings_b     0
# 6:    Ben Strings_b     2
# 7:  Sally Strings_B     1
# 8:    Ben Strings_B     1  

您的问题并不清楚,但是如果您将数据限制在&#34; A&#34;,&#34; a&#34;,&#34; B&#34;和&#34; b&#34;只是为了说明的目的,你真的对完整的48个选项感兴趣,那么你也可以省略以下行:

subset(select = grep("Person|_[AaBb]$", names(.)))