根据ngrams

时间:2017-03-22 09:46:07

标签: r dataframe split

我的数据框有很多术语(不同大小的ngrams,最多五格)及其各自的频率:

df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
                         "a a divorce", "a a divorce lawyer", "be", "be the", "be the one"), 
                freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))

哪位给我们:

                 term freq
1                   a  131
2                 a a   13
3            a a card    3
4       a a card base    2
5    a a card base ne    1
6         a a divorce    1
7  a a divorce lawyer    1
8                  be   72
9              be the   17
10         be the one    5

我想要的是将unigrams(只有一个单词的术语),bigrams(含有两个单词的术语),trigrams,fourgrams和fivegrams分成不同的数据框:

例如,仅包含unigrams的“df1”将如下所示:

                 term freq
1                   a  131
2                  be   72

“df2”(双子座):

                 term freq
1                 a a   13
2              be the   17

“df3”(三卦):

                 term freq
1            a a card    3
2         a a divorce    1
3          be the one    5

等等。任何的想法?可能是正则表达式?

1 个答案:

答案 0 :(得分:6)

您可以按空格分数,即

split(df, stringr::str_count(df$term, '\\s+'))

#$`0`
#  term freq
#1    a  131
#8   be   72

#$`1`
#    term freq
#2    a a   13
#9 be the   17

#$`2`
#          term freq
#3     a a card    3
#6  a a divorce    1
#10  be the one    5

#$`3`
#                term freq
#4      a a card base    2
#7 a a divorce lawyer    1

#$`4`
#              term freq
#5 a a card base ne    1

单独的基础R解决方案(如@akrun提到的那样),

split(df, lengths(gregexpr("\\S+", df$term)))