从字符串匹配中总结

时间:2019-02-20 13:38:27

标签: r string summary

我有此df列:

df <- data.frame(Strings = c("ñlas onepojasd", "onenañdsl", "ñelrtwofkld", "asdthreeasp", "asdfetwoasd", "fouroqwke","okasdtwo", "acmofour", "porefour", "okstwo"))
> df
          Strings
1  ñlas onepojasd
2       onenañdsl
3     ñelrtwofkld
4     asdthreeasp
5     asdfetwoasd
6       fouroqwke
7        okasdtwo
8        acmofour
9        porefour
10         okstwo

我知道df$Strings中的每个值都将与单词one, two, three or four相匹配。而且我也知道它将仅与其中一个单词匹配。因此要匹配它们:

str_detect(df$Strings,"one")
str_detect(df$Strings,"two")
str_detect(df$Strings,"three")
str_detect(df$Strings,"four")

但是,由于要尝试创建此表,我被困在这里:

Homes  Quantity Percent
  One         2     0.3
  Two         4     0.4
Three         1     0.1
 Four         3     0.3
Total        10       1

3 个答案:

答案 0 :(得分:2)

使用tidyversejanitor,您可以执行以下操作:

df %>%
 mutate(Homes = str_extract(Strings, "one|two|three|four"),
        n = n()) %>%
 group_by(Homes) %>%
 summarise(Quantity = length(Homes),
           Percent = first(length(Homes)/n)) %>%
 adorn_totals("row")

 Homes Quantity Percent
  four        3     0.3
   one        2     0.2
 three        1     0.1
   two        4     0.4
 Total       10     1.0

或仅使用tidyverse

 df %>%
 mutate(Homes = str_extract(Strings, "one|two|three|four"),
        n = n()) %>%
 group_by(Homes) %>%
 summarise(Quantity = length(Homes),
           Percent = first(length(Homes)/n)) %>%
 rbind(., data.frame(Homes = "Total", Quantity = sum(.$Quantity), 
                     Percent = sum(.$Percent)))

在两种情况下,代码首先提取匹配的模式并计算案例数。其次,将匹配的单词分组。第三,它计算每个单词的案例数以及所有单词中给定单词的比例。最后,它添加一个“总计”行。

答案 1 :(得分:1)

您可以使用str_extract,然后执行tableprop.table,即

library(stringr)

str_extract(df1$Strings, 'one|two|three|four')
#[1] "one"   "one"   "two"   "three" "two"   "four"  "two"   "four"  "four"  "two"  

table(str_extract(df1$Strings, 'one|two|three|four'))
# four   one three   two 
#    3     2     1     4 

prop.table(table(str_extract(df1$Strings, 'one|two|three|four')))
# four   one three   two 
#  0.3   0.2   0.1   0.4 

答案 2 :(得分:0)

一个base R选项将是regmatches/regexprtable

table(regmatches(df$Strings, regexpr('one|two|three|four', df$Strings)))
#  four   one three   two 
#    3     2     1     4 

添加addmargins得到sum,然后除以

out <- addmargins(table(regmatches(df$Strings, 
     regexpr('one|two|three|four', df$Strings))))
out/out[length(out)]

# four   one three   two   Sum 
#  0.3   0.2   0.1   0.4   1.0