Question

我的数据框有很多术语（不同大小的ngrams，最多五格）及其各自的频率：

df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
                         "a a divorce", "a a divorce lawyer", "be", "be the", "be the one"), 
                freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))

哪位给我们：

                 term freq
1                   a  131
2                 a a   13
3            a a card    3
4       a a card base    2
5    a a card base ne    1
6         a a divorce    1
7  a a divorce lawyer    1
8                  be   72
9              be the   17
10         be the one    5

我想要的是将unigrams（只有一个单词的术语），bigrams（含有两个单词的术语），trigrams，fourgrams和fivegrams分成不同的数据框：

例如，仅包含unigrams的“df1”将如下所示：

                 term freq
1                   a  131
2                  be   72

“df2”（双子座）：

                 term freq
1                 a a   13
2              be the   17

“df3”（三卦）：

                 term freq
1            a a card    3
2         a a divorce    1
3          be the one    5

等等。任何的想法？可能是正则表达式？

Answer 1

您可以按空格分数，即

split(df, stringr::str_count(df$term, '\\s+'))

#$`0`
#  term freq
#1    a  131
#8   be   72

#$`1`
#    term freq
#2    a a   13
#9 be the   17

#$`2`
#          term freq
#3     a a card    3
#6  a a divorce    1
#10  be the one    5

#$`3`
#                term freq
#4      a a card base    2
#7 a a divorce lawyer    1

#$`4`
#              term freq
#5 a a card base ne    1

单独的基础R解决方案（如@akrun提到的那样），

split(df, lengths(gregexpr("\\S+", df$term)))

根据ngrams

1 个答案: