是否有R函数来拆分句子

时间:2019-05-19 04:42:37

标签: r

我有几个像下面这样的非结构化句子。下面的描述是列名

Description

Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only

我想将此句子从Col1拆分为Col5并计算如下的出现次数

Col1             Col2            Col3               Col4               
Automatic_lever lever_for        for_a               a_machine  
Vaccum_chamber  chamber_with     with_additional    additional_spare     
Glove_box       box_for          for_R&D            R&D 
The_Mini        Mini_Guage       Guage_5             5_sets 
Vacuum_chamber  chamber_only     only       
Automatic_lever lever_only       only       

我也可以从上面的几列中看到这些单词的出现。就像,Vaccum_chamber和Automatic_lever在这里重复两次。同样,出现其他单词吗?

2 个答案:

答案 0 :(得分:0)

这是一个tidyverse选项

df %>%
    rowid_to_column("row") %>%
    mutate(words = map(str_split(Description, " "), function(x) {
        if (length(x) %% 2 == 0) words <- c(words, "")
        idx <- 1:(length(words) - 1)
        map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
    })) %>%
    unnest() %>%
    group_by(row) %>%
    mutate(
        words = str_replace(words, "_NA", ""),
        col = paste0("Col", 1:n())) %>%
    filter(words != "NA") %>%
    spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups:   row [6]
#    row Description                Col1        Col2       Col3       Col4
#  <int> <fct>                      <chr>       <chr>      <chr>      <chr>
#1     1 Automatic lever for a mac… Automatic_… lever_for  for_a      a_machine
#2     2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3     3 Glove box for R&D          Glove_box   box_for    for_R&D    R&D
#4     4 The Mini Guage 5 sets      The_Mini    Mini_Guage Guage_5    5_sets
#5     5 Vacuum chamber only        Vacuum_cha… chamber_o… only       ""
#6     6 Automatic lever only       Automatic_… lever_only only       ""

说明:我们将Description中的句子在单个空格" "上分割,然后将每两个单词与一个滑动窗口方法连接起来,确保每个单词始终有奇数个单词sentence;剩下的只是一个漫长的转变。

不太漂亮,但可以再现您的预期输出;除了手动滑动窗口方法,您也可以zoo::rollapply


样本数据

df <- read.table(text =
    "Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)

答案 1 :(得分:0)

您可以使用ngram [1]包来生成输出。

library(ngram)
x <- "Automatic lever for a machine"
ngram_asweka(x, min = 2, max = 2, sep = " ")
gsub(" ", "_", ngram_asweka(x, min = 2, max = 2, sep = " "))

输出:“ Automatic_lever”“ lever_for”“ for_a”“ a_machine”

然后,您可以手动添加最后一个元素。

  1. https://cran.r-project.org/web/packages/ngram/ngram.pdf