我试图根据半结构化文本文档中的标题提取文本。
输入
Column<-"Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Report: Need to complete Conclusion: Dud"
此处的输出是
Order Subject Name Grade Report Conclusion
1223442 History Bilbo Johnson Bad Need to complete Dud
我可以使用以下(凌乱但有效)功能来实现此目的:
dataframeIn<-data.frame(Column,stringsAsFactors=FALSE)
delim<-c("Order","Subject","Name","Grade","Report","Conclusion")
Extractor <- function(dataframeIn, Column, delim) {
dataframeInForLater<-dataframeIn
ColumnForLater<-Column
Column <- rlang::sym(Column)
dataframeIn <- data.frame(dataframeIn)
dataframeIn<-dataframeIn %>%
tidyr::separate(!!Column, into = c("added_name",delim),
sep = paste(delim, collapse = "|"),
extra = "drop", fill = "right")
names(dataframeIn) <- gsub(".", "", names(dataframeIn), fixed = TRUE)
dataframeIn<-data.frame(dataframeIn)
#Add the original column back in so have the original reference
dataframeIn<-cbind(dataframeInForLater[,ColumnForLater],dataframeIn)
dataframeIn<-data.frame(dataframeIn)
return(dataframeIn)
}
Extractor(dataframeIn, "Column", delim)
但是,有时分隔符会丢失,例如
Order:1223442 Subject:History Name Bilbo Johnson Grade: Bad Conclusion: Dud
在这种情况下,所需的输出是
Order Subject Name Grade Conclusion
1223442 History Bilbo Johnson Bad Dud
但实际输出变为:
Order Subject Name Grade Report Conclusion
:1223442 :History Bilbo Johnson : Bad : Dud <NA>
我如何解释缺少的定界符,尽管它们的顺序相同(包括上面中间的示例中以及文本中末尾缺少的定界符)?
答案 0 :(得分:0)
我们可以执行以下操作(这只是文本提取,我将为您构造输出):
library(stringr)
Extractor <- function(x, delim) {
pattern <- paste0(delim, ":{0,1}(.*?)(", paste(c(delim, "$"), collapse = "|"), ")")
trimws(str_match(x, pattern)[, 2])
}
Extractor(Column1, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" "Need to complete" "Dud"
Extractor(Column2, delim)
# [1] "1223442" "History" "Bilbo Johnson" "Bad" NA "Dud"
Column3 <- "Subject:History Name Bilbo Johnson"
Extractor(Column3, delim)
# [1] NA "History" "Bilbo Johnson" NA NA NA
由于有了NA
,因此很明显缺少了哪些分隔符,没有了。
在您的情况下,它的工作方式是我们有一系列的模式
pattern
# [1] "Order:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [2] "Subject:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [3] "Name:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [4] "Grade:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [5] "Report:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
# [6] "Conclusion:{0,1}(.*?)(Order|Subject|Name|Grade|Report|Conclusion|$)"
然后str_match
nice将(.*?)
部分提取到第二个输出列中,我们用trimws
除去了任何空格。嗯,我们在(.*?)
中使用了惰性匹配,以免匹配过多。