从R中文件夹中的所有文件中提取两个单词之间的文本

时间:2017-12-01 09:39:04

标签: r regex

我有一个包含许多.txt文件的文件夹。我想读取所有文件,然后从位于两个单词之间的每个文件中提取文本,并将它们存储在.csv文件中。

要提取的文字总是在两个单词之间

IMPRESSION:  "text to be extracted"  (Dr. Deepak Bhatt)

OR

IMPRESSION : "text to be extracted"  (Dr. Deepak Bhatt)

我在下面写的代码不是从所有文件中提取文本。我该如何解决这个问题?

    names <- list.files(path = "C:\\Users\\Admin\\Downloads\\data\\data",
     pattern = "*.txt", all.files = FALSE,
               full.names = FALSE, recursive = FALSE,
               ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

    all.names <- lapply(names,readFn)

    readFn <- function(i)
   {

    file <- read_file(i)

    file <- gsub("[\r\n\t]", " ", file)

    extracted_txt <- rm_between(file, 
    'IMPRESSION :', '(Dr. Deepak Bhatt)', 
    extract=TRUE, trim = TRUE, clean = TRUE)

    if(is.na(extracted_txt))
    {
    extracted_txt <- rm_between(file, 
    'IMPRESSION:', '(Dr. Deepak Bhatt)', 
    extract=TRUE, trim = TRUE, clean = TRUE)
    }

    }


    output <- do.call(rbind,all.names)
    name_of_file <- sub(".txt","",names)
    final_output <- cbind(name_of_file,output)
    colnames(final_output) <- c('filename','text')
    write.csv(final_output,"final_output.csv",row.names=F)

示例1:filename = 15-1-2011.txt

The optic nerve is normal.


There is diffuse enlargement of the lacrimal gland (more marked on the left side).

IMPRESSION:

Bilateral diffuse irregular enlargement of the lacrimal gland is due to inflammatory enlargement (? Sjogerns syndrome).
The left gland is more enlarged than right.
No mass lesion or cystic lesion noted.
No evidence of retinal detachment.


(Dr. Deepak Bhatt)

(B-Scan findings are interpretation of echoes and need to be correlated clinically)

示例2:1-12-48.txt

The ciliary body and ciliary process are normal in position and texture.

There is marked steching of the zonules.


IMPRESSION :

Left sided marked stretching of the zonules noted from 2 to 6 O’clock position.
There is absence of zonules at 3 O’clock position.
The angle is normal and the ciliary body, processes are normal in position.


(Dr. Deepak Bhatt)

(UBM findings are interpretation of echoes and need to be correlated clinically) 
####目标
OUTPUT file: final_output.csv

15-1-2011      Bilateral diffuse.....retinal detachment.

1-12-48        Left sided marked stretching of the zonules ...in  position.

1 个答案:

答案 0 :(得分:1)

您可以使用gsub

text_between_words <- "IMPRESSION:  text to be extracted  (Dr. Deepak Bhatt)"
gsub('IMPRESSION:\\s+(.*)\\s+\\(.*\\)', '\\1', text_between_words)

结果:

[1] "text to be extracted "

或与trimws结合使用:

trimws(gsub('IMPRESSION:(.*)\\(.*\\)', '\\1', text_between_words))

结果:

[1] "text to be extracted"

IMPRESSION:之间有空格时,您可以将代码调整为:

text_between_words2 <- "IMPRESSION :  text to be extracted  (Dr. Deepak Bhatt)"
trimws(gsub('IMPRESSION\\s{0,1}:(.*)\\(.*\\)', '\\1', text_between_words2))

如您所见,我在\\s{0,1}IMPRESSION之间添加了:。这将查看IMPRESSION:之间是否有零或一个空格。结果是:

[1] "text to be extracted"

对于以下评论中要求的修改,您还需要调整方法:

text_between_words3 <- "Some Text before..... IMPRESSION: text to be extracted (Dr. Deepak Bhatt) text that should go too"
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(.*\\).*', '\\1', text_between_words3))

结果:

[1] "text to be extracted"

如果文本中只有特定名称(Dr. Deepak Bhatt),您也可以这样做:

trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(Dr. Deepak Bhatt\\).*', '\\1', text_between_words3))