我的数据框有很多术语(不同大小的ngrams,最多五格)及其各自的频率:
df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
"a a divorce", "a a divorce lawyer", "be", "be the", "be the one"),
freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))
哪位给我们:
term freq
1 a 131
2 a a 13
3 a a card 3
4 a a card base 2
5 a a card base ne 1
6 a a divorce 1
7 a a divorce lawyer 1
8 be 72
9 be the 17
10 be the one 5
我想要的是将unigrams(只有一个单词的术语),bigrams(含有两个单词的术语),trigrams,fourgrams和fivegrams分成不同的数据框:
例如,仅包含unigrams的“df1”将如下所示:
term freq
1 a 131
2 be 72
“df2”(双子座):
term freq
1 a a 13
2 be the 17
“df3”(三卦):
term freq
1 a a card 3
2 a a divorce 1
3 be the one 5
等等。任何的想法?可能是正则表达式?
答案 0 :(得分:6)
您可以按空格分数,即
split(df, stringr::str_count(df$term, '\\s+'))
#$`0`
# term freq
#1 a 131
#8 be 72
#$`1`
# term freq
#2 a a 13
#9 be the 17
#$`2`
# term freq
#3 a a card 3
#6 a a divorce 1
#10 be the one 5
#$`3`
# term freq
#4 a a card base 2
#7 a a divorce lawyer 1
#$`4`
# term freq
#5 a a card base ne 1
单独的基础R解决方案(如@akrun提到的那样),
split(df, lengths(gregexpr("\\S+", df$term)))