我一直在使用tidyr将一些文本分成几列。
Mypathcolon <- data.frame(c("1 Hospital: Random NHS Foundation Trust\nHospital Number: H2890235\nPatient Name: al-Bilal, Widdad\nDOB: 1922-05-04\nGeneral Practitioner: Dr. Mondragon, Amber\nDate received: 2002-11-10\nClinical Details: Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen. Nature of specimen: Nature of specimen as stated on pot = 'Ascending colon x2 '|,Nature of specimen as stated on request form = 'rectum'|,Nature of specimen as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,Nature of specimen as stated on pot = 'rectal polyp '|\nMacroscopic description: 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm\nHistology: The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4."))
names(Mypathcolon)<-c("PathReportWhole")
Histoltree <- c("Hospital Number:","Patient Name:",
"DOB:","General Practitioner:","Date received:",
"Clinical Details","Nature of specimen",
"Macroscopic description:","Histology","Diagnosis")
Mypathcolon %>%
tidyr::separate(PathReportWhole,
into = c("added_name",Histoltree),
sep = paste(Histoltree, collapse = "|"))
这给了我列名
[1] "added_name" "Hospital Number:" "Patient Name:" "DOB:"
[5] "General Practitioner:" "Date received:" "Clinical Details" "Nature of specimen"
[9] "Macroscopic description:" "Histology" "Diagnosis"
然而,从“样本性质”到“诊断”栏目中的数据实际上包含文本中“样本性质”到“样本性质”的文本而不是“样本性质”到“宏观描述”应该是:见下面的实际输出:
structure(list(added_name = "1 Hospital: Random NHS Foundation Trust\n",
`Hospital Number:` = " H2890235\n", `Patient Name:` = " al-Bilal, Widdad\n",
`DOB:` = " 1922-05-04\n", `General Practitioner:` = " Dr. Mondragon, Amber\n",
`Date received:` = " 2002-11-10\n", `Clinical Details` = ": Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen. ",
`Nature of specimen` = ": ", `Macroscopic description:` = " as stated on pot = 'Ascending colon x2 '|,",
Histology = " as stated on request form = 'rectum'|,", Diagnosis = " as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,"), .Names = c("added_name",
"Hospital Number:", "Patient Name:", "DOB:", "General Practitioner:",
"Date received:", "Clinical Details", "Nature of specimen", "Macroscopic description:",
"Histology", "Diagnosis"), row.names = 1L, class = "data.frame")
如何强制函数在列出的分隔符之间提取列,而不是像重复提取的那样。
Hospital: Random NHS Foundation Trust\n
Hospital Number: H2890235\n
Patient Name: al-Bilal, Widdad\n
DOB: 1922-05-04\n
General Practitioner: Dr. Mondragon, Amber\n
Date received: 2002-11-10\n
Clinical Details: Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen.
Nature of specimen: Nature of specimen as stated on pot = 'Ascending colon x2 '|,Nature of specimen as stated on request form = 'rectum'|,Nature of specimen as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,Nature of specimen as stated on pot = 'rectal polyp '|\n
Macroscopic description: 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm\n
Histology: The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4.
答案 0 :(得分:1)
修改后,我可以看到你想要的东西。这里的关键是用两种模式分割蜇伤。然后,您要创建一个数据框。 cSplit()
包中的splitstackshape
可以使用分隔符(即:
)拆分字符串。
library(dplyr)
library(tidyr)
library(stringi)
library(splitstackshape)
# Convert factor to character
Mypathcolon$PathReportWhole <- as.character(Mypathcolon$PathReportWhole)
# Split the string at two specific points, create a data frame,
# assign a column name, split strings
temp <- stri_split_regex(str = Mypathcolon$PathReportWhole, pattern = "\\n(?=[A-Z])|\\.\\s(?=.*:)") %>%
as.data.frame %>%
setNames("foo") %>%
cSplit("foo", sep = ":", direction = "wide", type.convert = FALSE)
foo_1
1: 1 Hospital
2: Hospital Number
3: Patient Name
4: DOB
5: General Practitioner
6: Date received
7: Clinical Details
8: Nature of specimen
9: Macroscopic description
10: Histology
foo_2
1: Random NHS Foundation Trust
2: H2890235
3: al-Bilal, Widdad
4: 1922-05-04
5: Dr. Mondragon, Amber
6: 2002-11-10
7: Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index\n3 specimen
8: Nature of specimen as stated on pot = 'Ascending colon x2 '|,Nature of specimen as stated on request form = 'rectum'|,Nature of specimen as stated on pot = '4X LOWER, 4X UPPER OESOPHAGUS '|,Nature of specimen as stated on pot = 'rectal polyp '|
9: 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm
10: The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4.