作为qdap
的新手,我不确定这个功能是否存在,但如果有下面提到的内容会很棒。
我的初始数据集。
ID Keywords
1 112 mills, open heart surgery, great, great job
2 Ausie, open, heart out
3 opened, heartily, 56mg)_job, orders12
4 order, macD
在使用all_words()
时,我最终得到以下数据。
WORD FREQ
1 great 2
2 heart 2
3 open 2
4 ausie 1
5 heartily 1
6 job 1
7 macd 1
8 mgjob 1
9 mills 1
10 opened 1
11 order 1
12 orders 1
13 out 1
14 surgery 1
是否有一种方法可以将主要数据集替换为通过all_words()
出现的确切字词?
EDIT1: 因此,使用all_words()的列表应该替换数据框中的原始单词,即112 mills应该成为mills,56mg)_job应该变成mgjob。
答案 0 :(得分:1)
它有点手动,我不知道你的数据是如何格式化的,但是做一些修补工作应该做的工作:
编辑:它没有使用qdap
,但我认为这不是问题的关键部分。
第2次修改:我忘记了替换,更正了以下代码。
library(data.table)
library(tm) # Functions with tm:: below
library(magrittr)
dt <- data.table(
ID = 1L:4L,
Keywords = c(
paste('112 mills', 'open heart', 'surgery', 'great', 'great job', sep = ' '),
paste('Ausie', 'open', 'heart out', sep = ' '),
paste('opened', 'heartily', '56mg)_job', 'orders12', sep = ' '),
paste('order', 'macD', sep = ' ')))
# dt_2 <- data.table(Tokens = tm::scan_tokenizer(dt[, Keywords]))
dt_2 <- dt[, .(Tokens = unlist(strsplit(Keywords, split = ' '))), by = ID]
dt_2[, Words := tm::scan_tokenizer(Tokens) %>%
tm::removePunctuation() %>%
tm::removeNumbers()
]
dt_2[, Stems := tm::stemDocument(Words)]
dt_2
# ID Tokens Words Stems
# 1: 1 112
# 2: 1 mills mills mill
# 3: 1 open open open
# 4: 1 heart heart heart
# 5: 1 surgery surgery surgeri
# 6: 1 great great great
# 7: 1 great great great
# 8: 1 job job job
# 9: 2 Ausie Ausie Ausi
# 10: 2 open open open
# 11: 2 heart heart heart
# 12: 2 out out out
# 13: 3 opened opened open
# 14: 3 heartily heartily heartili
# 15: 3 56mg)_job mgjob mgjob
# 16: 3 orders12 orders order
# 17: 4 order order order
# 18: 4 macD macD macD
# Frequencies
dt_2[, .N, by = Words]
# Words N
# 1: 1
# 2: mills 1
# 3: open 2
# 4: heart 2
# 5: surgery 1
# 6: great 2
# 7: job 1
# 8: Ausie 1
# 9: out 1
# 10: opened 1
# 11: heartily 1
# 12: mgjob 1
# 13: orders 1
# 14: order 1
# 15: macD 1
第二次编辑:
res <- dt_2[, .(Keywords = paste(Words, collapse = ' ')), by = ID]
res
# ID Keywords
# 1: 1 mills open heart surgery great great job
# 2: 2 Ausie open heart out
# 3: 3 opened heartily mgjob orders
# 4: 4 order macD
第3次修改,以防您的关键字作为列表出现,并且您希望保持这种方式。
library(data.table)
library(tm) # Functions with tm:: below
library(magrittr)
dt <- data.table(
ID = 1L:4L,
Keywords = list(
c('112 mills', 'open heart', 'surgery', 'great', 'great job'),
c('Ausie', 'open', 'heart out'),
c('opened', 'heartily', '56mg)_job', 'orders12'),
c('order', 'macD')))
dt_2 <- dt[, .(Keywords = unlist(Keywords)), by = ID]
dt_2[, ID_temp := .I]
dt_3 <- dt_2[, .(ID, Tokens = unlist(strsplit(unlist(Keywords), split = ' '))), by = ID_temp]
dt_3[, Words := tm::scan_tokenizer(Tokens) %>%
tm::removePunctuation() %>%
tm::removeNumbers() %>%
stringr::str_to_lower()
]
dt_3[, Stems := tm::stemDocument(Words)]
dt_3
res <- dt_3[, .(
ID = first(ID),
Keywords = paste(Words, collapse = ' ') %>% stringr::str_trim()),
by = ID_temp]
res <- res[, .(Keywords = list(Keywords)), by = ID]
# Confirm format (a list of keywords in every element)
dt[1, Keywords] %T>% {print(class(.))} %T>% {print(length(.[[1]]))}
res[1, Keywords] %T>% {print(class(.))} %T>% {print(length(.[[1]]))}