Question

我的数据框有Category和pd。我需要计算每个pd中所有Category中每个有意义的单词的次数。我坚持最后一步 - 总结。理想情况下，pd乘以Category的频率与总长度的比率将是另一个X列。

示例：

freq = structure(list(Category = c("C1", "C2"
), pd = c("96 oz, epsom salt 96 oz, epsom bath salt", 
          "17 x 24 in, bath mat")), .Names = c("Category", "pd"), row.names = c(NA, 
                                                                                -2L), class = "data.frame")

pool = sort(unique(gsub("[[:punct:]]|[0-9]","", unlist(strsplit(freq[,2]," ")))))
pool = pool[nchar(pool)>1]

freq：

    Category    pd
1   C1  96 oz, epsom salt 96 oz, epsom bath salt
2   C2  17 x 24 in, bath mat

pool：

[1] "bath"  "epsom" "in"    "mat"   "oz"    "salt"

期望的输出：

pool C1freq C1ratio C2freq C2ratio
bath 1 1/7 1 1/3
epsom 2 2/7 0 0
in 0 0 1 1/3
mat 0 0 1 1/3
oz 2 2/7 0 0
salt 2 2/7 0 0

例如， 7是C1[,2]的长度，带有数字，删除了标点符号（如pool规则中所示）。 1/7当然不需要这种形式 - 这里只是显示分母长度。

如果可能，无dplyr或qdap。谢谢！！

Answer 1

我们可以试试

library(qdapTools)
library(stringr)
lst <- str_extract_all(freq$pd, '[A-Za-z]{2,}')
m1 <- t(mtabulate(lst))
m2 <-  prop.table(m1,2)
cbind(m1, m2)[,c(1,3,2,4)]

或没有qdapTools，

 Un1 <- sort(unique(unlist(lst)))
 m1 <- do.call(cbind, lapply(lst, function(x)
            table(factor(x, levels=Un1))))
 colnames(m1) <- freq$Category
 cbind(m1, `colnames<-`(prop.table(m1,2), paste0(colnames(m1), 'Prop')))

Answer 2

您可以考虑采用以下方式调整当前的方法：

tab <- table(
  stack(
    setNames(
      lapply(strsplit(gsub("[[:punct:]]|[0-9]", "", freq$pd), "\\s+"), 
             function(x) x[nchar(x) > 1]), freq$Category)))

请注意，我首先使用gsub，而不是分割后使用setNames。然后，我拆分了一个空格，并以与过滤它相同的方式过滤数据。最后，我使用了stack，以便我可以使用data.frame获得可以制表的长prop.table。

将数据制成表格后，只需使用cbind(tab, prop.table(tab, 2)) # C1 C2 C1 C2 # bath 1 1 0.1428571 0.3333333 # epsom 2 0 0.2857143 0.0000000 # in 0 1 0.0000000 0.3333333 # mat 0 1 0.0000000 0.3333333 # oz 2 0 0.2857143 0.0000000 # salt 2 0 0.2857143 0.0000000即可获得所需的输出。

{{1}}

R：计算数据帧中列表中单词的出现

2 个答案: