我有一个包含许多.txt文件的文件夹。我想读取所有文件,然后从位于两个单词之间的每个文件中提取文本,并将它们存储在.csv文件中。
要提取的文字总是在两个单词之间
IMPRESSION: "text to be extracted" (Dr. Deepak Bhatt)
OR
IMPRESSION : "text to be extracted" (Dr. Deepak Bhatt)
我在下面写的代码不是从所有文件中提取文本。我该如何解决这个问题?
names <- list.files(path = "C:\\Users\\Admin\\Downloads\\data\\data",
pattern = "*.txt", all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
all.names <- lapply(names,readFn)
readFn <- function(i)
{
file <- read_file(i)
file <- gsub("[\r\n\t]", " ", file)
extracted_txt <- rm_between(file,
'IMPRESSION :', '(Dr. Deepak Bhatt)',
extract=TRUE, trim = TRUE, clean = TRUE)
if(is.na(extracted_txt))
{
extracted_txt <- rm_between(file,
'IMPRESSION:', '(Dr. Deepak Bhatt)',
extract=TRUE, trim = TRUE, clean = TRUE)
}
}
output <- do.call(rbind,all.names)
name_of_file <- sub(".txt","",names)
final_output <- cbind(name_of_file,output)
colnames(final_output) <- c('filename','text')
write.csv(final_output,"final_output.csv",row.names=F)
示例1:filename = 15-1-2011.txt
The optic nerve is normal.
There is diffuse enlargement of the lacrimal gland (more marked on the left side).
IMPRESSION:
Bilateral diffuse irregular enlargement of the lacrimal gland is due to inflammatory enlargement (? Sjogerns syndrome).
The left gland is more enlarged than right.
No mass lesion or cystic lesion noted.
No evidence of retinal detachment.
(Dr. Deepak Bhatt)
(B-Scan findings are interpretation of echoes and need to be correlated clinically)
#
示例2:1-12-48.txt
The ciliary body and ciliary process are normal in position and texture.
There is marked steching of the zonules.
IMPRESSION :
Left sided marked stretching of the zonules noted from 2 to 6 O’clock position.
There is absence of zonules at 3 O’clock position.
The angle is normal and the ciliary body, processes are normal in position.
(Dr. Deepak Bhatt)
(UBM findings are interpretation of echoes and need to be correlated clinically)
####目标
OUTPUT file: final_output.csv
15-1-2011 Bilateral diffuse.....retinal detachment.
1-12-48 Left sided marked stretching of the zonules ...in position.
答案 0 :(得分:1)
您可以使用gsub
:
text_between_words <- "IMPRESSION: text to be extracted (Dr. Deepak Bhatt)"
gsub('IMPRESSION:\\s+(.*)\\s+\\(.*\\)', '\\1', text_between_words)
结果:
[1] "text to be extracted "
或与trimws
结合使用:
trimws(gsub('IMPRESSION:(.*)\\(.*\\)', '\\1', text_between_words))
结果:
[1] "text to be extracted"
当IMPRESSION
和:
之间有空格时,您可以将代码调整为:
text_between_words2 <- "IMPRESSION : text to be extracted (Dr. Deepak Bhatt)"
trimws(gsub('IMPRESSION\\s{0,1}:(.*)\\(.*\\)', '\\1', text_between_words2))
如您所见,我在\\s{0,1}
和IMPRESSION
之间添加了:
。这将查看IMPRESSION
和:
之间是否有零或一个空格。结果是:
[1] "text to be extracted"
对于以下评论中要求的修改,您还需要调整方法:
text_between_words3 <- "Some Text before..... IMPRESSION: text to be extracted (Dr. Deepak Bhatt) text that should go too"
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(.*\\).*', '\\1', text_between_words3))
结果:
[1] "text to be extracted"
如果文本中只有特定名称(Dr. Deepak Bhatt
),您也可以这样做:
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(Dr. Deepak Bhatt\\).*', '\\1', text_between_words3))