我有几个像下面这样的非结构化句子。下面的描述是列名
Description
Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only
我想将此句子从Col1拆分为Col5并计算如下的出现次数
Col1 Col2 Col3 Col4
Automatic_lever lever_for for_a a_machine
Vaccum_chamber chamber_with with_additional additional_spare
Glove_box box_for for_R&D R&D
The_Mini Mini_Guage Guage_5 5_sets
Vacuum_chamber chamber_only only
Automatic_lever lever_only only
我也可以从上面的几列中看到这些单词的出现。就像,Vaccum_chamber和Automatic_lever在这里重复两次。同样,出现其他单词吗?
答案 0 :(得分:0)
这是一个tidyverse
选项
df %>%
rowid_to_column("row") %>%
mutate(words = map(str_split(Description, " "), function(x) {
if (length(x) %% 2 == 0) words <- c(words, "")
idx <- 1:(length(words) - 1)
map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
})) %>%
unnest() %>%
group_by(row) %>%
mutate(
words = str_replace(words, "_NA", ""),
col = paste0("Col", 1:n())) %>%
filter(words != "NA") %>%
spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups: row [6]
# row Description Col1 Col2 Col3 Col4
# <int> <fct> <chr> <chr> <chr> <chr>
#1 1 Automatic lever for a mac… Automatic_… lever_for for_a a_machine
#2 2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3 3 Glove box for R&D Glove_box box_for for_R&D R&D
#4 4 The Mini Guage 5 sets The_Mini Mini_Guage Guage_5 5_sets
#5 5 Vacuum chamber only Vacuum_cha… chamber_o… only ""
#6 6 Automatic lever only Automatic_… lever_only only ""
说明:我们将Description
中的句子在单个空格" "
上分割,然后将每两个单词与一个滑动窗口方法连接起来,确保每个单词始终有奇数个单词sentence
;剩下的只是一个漫长的转变。
不太漂亮,但可以再现您的预期输出;除了手动滑动窗口方法,您也可以zoo::rollapply
。
df <- read.table(text =
"Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)
答案 1 :(得分:0)
您可以使用ngram
[1]包来生成输出。
library(ngram)
x <- "Automatic lever for a machine"
ngram_asweka(x, min = 2, max = 2, sep = " ")
gsub(" ", "_", ngram_asweka(x, min = 2, max = 2, sep = " "))
输出:“ Automatic_lever”“ lever_for”“ for_a”“ a_machine”
然后,您可以手动添加最后一个元素。