使字符串集成为按组排序的数据帧,其中包含序列信息

时间:2018-11-14 03:26:06

标签: r

我正在努力从字符串中组成一个有组织的数据框。

使用此输入

text = c('I do not want to do this thing anymore','you do not know what I mean','I will not do this thing','do not want anymore','you will see')

[1] "I do not want to do this thing anymore" "you do not know what I mean"           
[3] "I will not do this thing"               "do not want anymore"                   
[5] "you will see"  

我希望制作一个看起来像具有序列信息的文档术语表的数据帧。但是,我不知道如何实现这一目标。这既不是文档术语矩阵,也不是可以使用以下代码创建的数据框。

as.data.frame(t(stri_list2matrix(strsplit(as.character(text),' '))))

   V1   V2   V3      V4   V5    V6   V7    V8      V9
1   I   do  not    want   to    do this thing anymore
2 you   do  not    know what     I mean  <NA>    <NA>
3   I will  not      do this thing <NA>  <NA>    <NA>
4  do  not want anymore <NA>  <NA> <NA>  <NA>    <NA>
5 you will  see    <NA> <NA>  <NA> <NA>  <NA>    <NA>

我打算做的是这个

    V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11   V12     V13  V14  V15  V16
1 <NA>    I   do <NA>  not <NA> <NA> want   to   do this thing anymore <NA> <NA> <NA>
2  you <NA>   do <NA>  not <NA> know <NA> <NA> <NA> <NA>  <NA>    <NA> what    I mean
3 <NA>    I <NA> will  not <NA> <NA> <NA> <NA>   do this thing    <NA> <NA> <NA> <NA>
4 <NA> <NA>   do <NA>  not <NA> <NA> want <NA> <NA> <NA>  <NA> anymore <NA> <NA> <NA>
5  you <NA> <NA> will <NA>  see <NA> <NA> <NA> <NA> <NA>  <NA>    <NA> <NA> <NA> <NA>

result = data.frame(V1=c(NA,"you",NA,NA,"you"),
                    V2=c("I",NA,"I",NA,NA),
                    V3=c("do","do",NA,"do",NA),
                    V4=c(NA,NA,"will",NA,"will"),
                    V5=c("not","not","not","not",NA),
                    V6=c(NA,NA,NA,NA,"see"),
                    V7=c(NA,"know",NA,NA,NA),
                    V8=c("want",NA,NA,"want",NA),
                    V9=c("to",NA,NA,NA,NA),
                    V10=c("do",NA,"do",NA,NA),
                    V11=c("this",NA,"this",NA,NA),
                    V12=c("thing",NA,"thing",NA,NA),
                    V13=c("anymore",NA,NA,"anymore",NA),
                    V14=c(NA,"what",NA,NA,NA),
                    V15=c(NA,"I",NA,NA,NA),
                    V16=c(NA,"mean",NA,NA,NA))

这样我就可以还原原始的字符串列表。

origin = do.call(paste, c(result, sep=" "))
origin = gsub('( NA|NA\\s*)','',origin)
origin

[1] "I do not want to do this thing anymore" "you do not know what I mean"           
[3] "I will not do this thing"               "do not want anymore"                   
[5] "you will see"  

1 个答案:

答案 0 :(得分:0)

请找到以下代码,并告诉我这是否符合您的目的,只是输出数据框中的单词顺序与您的不同

library(stringi)
text = c('I do not want to do this thing anymore','you do not know what I mean','I will not do this thing','do not want anymore','you will see')

tf = as.data.frame(t(stri_list2matrix(strsplit(as.character(text),' '))),stringsAsFactors = F)
strs = unlist(strsplit(as.character(text),' '))

fstrs = unique(strs)

fdf = data.frame(matrix(ncol = length(fstrs),nrow = 0))
names(fdf) = fstrs

log_out = data.frame()
for(i in 1:nrow(tf)){

  log = as.data.frame(t(names(fdf)[ifelse((names(fdf) %in% as.character(tf[i,])) == F,NA,T)]))
  log_out = rbind(log_out,log)
}

输出将是

log_out
    V1   V2   V3   V4   V5   V6    V7      V8   V9  V10  V11  V12  V13  V14
1    I   do  not want   to this thing anymore <NA> <NA> <NA> <NA> <NA> <NA>
2    I   do  not <NA> <NA> <NA>  <NA>    <NA>  you know what mean <NA> <NA>
3    I   do  not <NA> <NA> this thing    <NA> <NA> <NA> <NA> <NA> will <NA>
4 <NA>   do  not want <NA> <NA>  <NA> anymore <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA>  <NA>    <NA>  you <NA> <NA> <NA> will  see