我有一个使用readtext()创建的数据框。它具有两列:doc_id,文本。对于每一行(doc_id),我想提取两个在文本列中重复n次的字符串之间的子字符串(以我的政府部门名称)。例如:
documents <- data.frame(doc_id = c("doc_1", "doc_2"),
text = c("PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 2 Department of Forestry \n Matters \n Blah blah blah", "PART 1 Department of Communications \n Matters \n Blah blah blah \n PART 3 Department of Health \n Matters \n Blah blah blah \n PART 5 Department of Sport \n Matters \n Blah blah"))
我想去的是:
"doc_1" "Department of Communications, Department of Forestry"
"doc_2" "Department of Communications, Department of Health, Department of Sport"
基本上,我想在PART和Matters之间提取字符串。我想在数据帧上使用dplyr :: rowwise操作,但不知道如何在两个重复的字符串之间提取多次。
答案 0 :(得分:3)
我们可以使用str_match_all
中的stringr
并提取“ PART”和“ Matters”之间的单词。它返回一个包含两列矩阵的列表,我们从中选择第二列作为捕获组,然后使用toString
将它们放在一个逗号分隔的字符串中。
out <- stringr::str_match_all(documents$text, "PART \\d+ (.*) \n Matters")
sapply(out, function(x) toString(x[, 2]))
#[1] "Department of Communications, Department of Forestry"
#[2] "Department of Communications, Department of Health, Department of Sport"
答案 1 :(得分:1)
我现在想不出rowwise
解决方案,但这也许也有帮助
library(dplyr)
documents %>%
mutate(text=strsplit(as.character(text), 'PART ')) %>%
tidyr::unnest(text) %>%
mutate(text=trimws(sub('\\d+ (.*) Matters.*', '\\1', text))) %>%
filter(text != '') %>%
group_by(doc_id) %>%
summarise(text=paste(text, collapse=', '))
它基本上在PART
处分割了所有文本,然后我们可以分别处理每个元素,以从较长的字符串中切出重要的文本。稍后,我们根据doc_id
将所有内容连接在一起。
答案 2 :(得分:0)
#Import Tidyverse
library(tidyverse)
#Use helper variable name to store resuts of the extracted departments based on the parttern
Helper <- str_extract_all(string = documents$text, pattern = "Department.*\\n")
#Clean Up the columns.
Helper1 <- lapply(Helper, FUN = str_replace_all, pattern=" \\n", replacement = ", ")
documents$Departments<-str_replace(str_trim(unlist(lapply(Helper1, FUN =paste, collapse= ""))), pattern = ",$", replacement = "")
#Remove Previous column of texts
documents <- select(documents, -c("text"))