我想执行某项操作,该操作将以提供的格式转换数据。
colA colB
textA textB textC textD m
textA textB n
textB textC p
textB textC textD q
type col_a col_b(frequency) col_c
unigram textA 2 m+n
unigram textB 4 m+n+p+q
unigram textC 3 m+p+q
unigram textD 2 m+q
bigram textA textB 2 m+n
bigram textB textC 3 m+p+q
bigram textC textD 2 m+q
trigram textA textB textC 1 m
trigram textB textC textD 2 m+q
fourgram textA textB textC textD 1 m
需要单独为unigram,bigram,trigram和fourgram执行此操作,然后对结果进行rbind。
答案 0 :(得分:3)
这是一个想法
n_grams <- function(n) {
unigrams1 <- unique(unlist(lapply(strsplit(df$colA, ' '), unique)))
t <- apply(combn(unigrams1, n), 2, paste, collapse = ' ')
t1 <- sapply(t, function(i) paste(df$colB[grepl(i, df$colA)], collapse = '+'))
return(t1[sapply(t1, nchar)>0])
}
#testing the function
n_grams(1)
# textA textB textC textD
# "m+n" "m+n+p+q" "m+p+q" "m+q"
n_grams(2)
#textA textB textB textC textC textD
# "m+n" "m+p+q" "m+q"
n_grams(3)
#textA textB textC textB textC textD
# "m" "m+q"
n_grams(4)
#textA textB textC textD
# "m"
构建所需的输出,然后
df1 <- data.frame(rbind(stack(n_grams(1)), stack(n_grams(2)), stack(n_grams(3)), stack(n_grams(4))))
df1$freq <- nchar(gsub('\\+', '', df1$values))
df1 <- df1[,c('ind', 'freq', 'values')]
df1
# ind freq values
#1 textA 2 m+n
#2 textB 4 m+n+p+q
#3 textC 3 m+p+q
#4 textD 2 m+q
#5 textA textB 2 m+n
#6 textB textC 3 m+p+q
#7 textC textD 2 m+q
#8 textA textB textC 1 m
#9 textB textC textD 2 m+q
#10 textA textB textC textD 1 m