将列中的文本转换为数据框

时间:2015-04-29 08:25:15

标签: r tm

我在数据集的列中有文本数据,如下所示

UNIQUEID Cloumn1
1        FG
2        PR FG RT
3        FG BR UP DR ST
....

我想将数据列转换为数据框,以便输出如下,这些文本(FG,RN等)成为变量

UNIQUEID   FG  PR RT BR UP DR ST
1           1  0  0  0  0  0  0
2           1  1  1  0  0  0  0
3           1  0  0  1  1  1  1
......

我已尝试将TM包转换为

corpus = Corpus(VectorSource(weather$codesum))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, PlainTextDocument)
dtm =  DocumentTermMatrix(corpus)
dtm = as.data.frame(as.matrix(dtm))
colnames(dtm) = make.names(colnames(dtm))
str(dtm)

data.frame:   20517 obs. of  1 variable:
 $ prfg: num  0 0 0 0 0 0 0 0 0 0 ...

当我看到输出时,我只找到一个变量。我希望所有文本都作为变量。

请建议解决方案

1 个答案:

答案 0 :(得分:0)

如果您喜欢tidyrdplyr,也可以尝试此解决方案:

# libraries
library (tidyr)
library(dplyr)

# your data
t <- "UNIQUEID,Cloumn1
1,FG
2,PR FG RT
3,FG BR UP DR ST"

df <- read.table(text=t, header = T, sep=',', stringsAsFactors=F)

# The interesting part
df %>%
  transform(                         # trasforms each string in an array
    Cloumn1 = strsplit(Cloumn1, " ")
    ) %>%
  unnest(Cloumn1) %>%                # for each string in Cloumn1 creates a row
  mutate(v = 1) %>%                  # let's add a dummy 1
  spread(Cloumn1, v, fill= 0)        # rows become columns and NA is replaced by 0