我有一个由一列组成的数据框。我想根据分隔符向量将文本分成单独的列。
输入:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"
预期输出:
structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"),
HospitalNumber = " 233456 ", PatientName = " Jonny Begood",
DOB = " 13/01/77 ", GeneralPractitioner = NA_character_,
Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ",
Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ",
Histology = " These show chronic reflux and other bits n bobs\n ",
Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName",
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails",
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")
我热衷于使用tidyr
中的单独函数,但无法弄清楚它是否会根据分隔符列表分开
列表将是:
mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")
然后我尝试了:
Mypath %>% separate(Mypath, mywords)
但是我明显错误地理解了我想不能列出分隔符的功能
Error: `var` must evaluate to a single number or a column name, not a list
是否有一种简单的方法可以使用tidyr(或csplit
使用列表或任何其他方式)来实现此目的。
答案 0 :(得分:2)
也许确保它就像一个dcf文件,你可以使用read.dcf
:
请注意,“mywords”与您的“mywords”略有不同。我已将冒号添加到“医院编号”和“患者姓名”。
mywords<-c("Hospital Number:","Patient Name:","DOB:","General Practitioner:",
"Date of Procedure:","Clinical Details:","Macroscopic description:",
"Histology:","Diagnosis:")
将相关列转换为字符,在“医院编号”后添加冒号。
Mypath$PathReportWhole <- as.character(Mypath$PathReportWhole)
Mypath$PathReportWhole <- gsub("Hospital Number", "Hospital Number:", Mypath$PathReportWhole)
使每个key: value
对都在它自己的行上。
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", Mypath$PathReportWhole)
使用read.dcf
阅读:
out <- read.dcf(textConnection(temp))
以下是一些示例数据,可以更轻松地查看生成的结构:
example <- c("var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here",
"var 1 xyz var 2: more text here var 5: not all values are there")
example <- data.frame(report = example)
example
# report
# 1 var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here
# 2 var 1 xyz var 2: more text here var 5: not all values are there
并且,执行相同的步骤:
mywords <- c("var 1:", "var 2:", "var 3:", "var 4:", "var 5:")
temp <- as.character(example$report)
temp <- gsub("var 1", "var 1:", temp)
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", temp)
read.dcf(textConnection(temp))
# var 1 var 2 var 3 var 4 var 5
# [1,] "abc" "some, text" "112" "value" "even more here"
# [2,] "xyz" "more text here" NA NA "not all values are there"
read.dcf(textConnection(temp), fields = c("var 1", "var 3", "var 5"))
# var 1 var 3 var 5
# [1,] "abc" "112" "even more here"
# [2,] "xyz" NA "not all values are there"