如何根据分隔符的字符向量将一个colmn分成多个

时间:2017-12-18 15:33:10

标签: r

我有一个由一列组成的数据框。我想根据分隔符向量将文本分成单独的列。

输入:

Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood  DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"

预期输出:

structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n    DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"), 
    HospitalNumber = " 233456 ", PatientName = " Jonny Begood", 
    DOB = " 13/01/77 ", GeneralPractitioner = NA_character_, 
    Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ", 
    Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ", 
    Histology = " These show chronic reflux and other bits n bobs\n ", 
    Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName", 
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails", 
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")

我热衷于使用tidyr中的单独函数,但无法弄清楚它是否会根据分隔符列表分开

列表将是:

mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")

然后我尝试了:

Mypath %>% separate(Mypath, mywords)

但是我明显错误地理解了我想不能列出分隔符的功能

Error: `var` must evaluate to a single number or a column name, not a list

是否有一种简单的方法可以使用tidyr(或csplit使用列表或任何其他方式)来实现此目的。

1 个答案:

答案 0 :(得分:2)

也许确保它就像一个dcf文件,你可以使用read.dcf

请注意,“mywords”与您的“mywords”略有不同。我已将冒号添加到“医院编号”和“患者姓名”。

mywords<-c("Hospital Number:","Patient Name:","DOB:","General Practitioner:",
           "Date of Procedure:","Clinical Details:","Macroscopic description:",
           "Histology:","Diagnosis:")

将相关列转换为字符,在“医院编号”后添加冒号。

Mypath$PathReportWhole <- as.character(Mypath$PathReportWhole)
Mypath$PathReportWhole <- gsub("Hospital Number", "Hospital Number:", Mypath$PathReportWhole)

使每个key: value对都在它自己的行上。

temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", Mypath$PathReportWhole)

使用read.dcf阅读:

out <- read.dcf(textConnection(temp))

以下是一些示例数据,可以更轻松地查看生成的结构:

example <- c("var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here",
            "var 1 xyz var 2: more text here var 5: not all values are there")
example <- data.frame(report = example)
example
#                                                                      report
# 1 var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here
# 2           var 1 xyz var 2: more text here var 5: not all values are there

并且,执行相同的步骤:

mywords <- c("var 1:", "var 2:", "var 3:", "var 4:", "var 5:")
temp <- as.character(example$report)
temp <- gsub("var 1", "var 1:", temp)
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", temp)
read.dcf(textConnection(temp))
#      var 1 var 2            var 3 var 4   var 5                     
# [1,] "abc" "some, text"     "112" "value" "even more here"          
# [2,] "xyz" "more text here" NA    NA      "not all values are there"

read.dcf(textConnection(temp), fields = c("var 1", "var 3", "var 5"))
#      var 1 var 3 var 5                     
# [1,] "abc" "112" "even more here"          
# [2,] "xyz" NA    "not all values are there"