我在数据集的列中有文本数据,如下所示
UNIQUEID Cloumn1
1 FG
2 PR FG RT
3 FG BR UP DR ST
....
我想将数据列转换为数据框,以便输出如下,这些文本(FG,RN等)成为变量
UNIQUEID FG PR RT BR UP DR ST
1 1 0 0 0 0 0 0
2 1 1 1 0 0 0 0
3 1 0 0 1 1 1 1
......
我已尝试将TM包转换为
corpus = Corpus(VectorSource(weather$codesum))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, PlainTextDocument)
dtm = DocumentTermMatrix(corpus)
dtm = as.data.frame(as.matrix(dtm))
colnames(dtm) = make.names(colnames(dtm))
str(dtm)
data.frame: 20517 obs. of 1 variable:
$ prfg: num 0 0 0 0 0 0 0 0 0 0 ...
当我看到输出时,我只找到一个变量。我希望所有文本都作为变量。
请建议解决方案
答案 0 :(得分:0)
如果您喜欢tidyr
和dplyr
,也可以尝试此解决方案:
# libraries
library (tidyr)
library(dplyr)
# your data
t <- "UNIQUEID,Cloumn1
1,FG
2,PR FG RT
3,FG BR UP DR ST"
df <- read.table(text=t, header = T, sep=',', stringsAsFactors=F)
# The interesting part
df %>%
transform( # trasforms each string in an array
Cloumn1 = strsplit(Cloumn1, " ")
) %>%
unnest(Cloumn1) %>% # for each string in Cloumn1 creates a row
mutate(v = 1) %>% # let's add a dummy 1
spread(Cloumn1, v, fill= 0) # rows become columns and NA is replaced by 0