循环浏览Word / PDF文档并将特定文本提取到表R.

时间:2018-01-05 11:39:23

标签: r excel pdf ms-word text-mining

我有一个包含大约150个Word和PDF(相同文本)文档的文件夹。 数据在这里:http://www.sicgen.pt/antigen_folder/data_sheet/AB0003_ERP57_AB_data_sheet2003.pdf

文字总是像(加载pdftools后):

library(pdftools)
u <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")

[1] "                                                                     Product Data Sheet\r\n                                                                                      001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n                                  ERp57 Polyclonal Antibody\r\nSource: Goat                                               phospholipase C alpha, PI PLC, protein disulfide\r\n                                                           isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This                   Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin        (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate           sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has         purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by        Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their            within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates.                                   ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated                Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide              blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident            mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample                Western blot      Immuno-        Histochemistry (paraffin)     Histochemistry (frozen)\r\n                                        fluorescence\r\nhuman                 +++               +++            +++                           +++\r\nrat                   +++               +++            +++                           +++\r\nmouse                 +++               +++            +++                           +++\r\ncanine                +++               +++            +++                           +++\r\nmonkey                +++               +++            +++                           +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot                    1:500-1:2,000       Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence                        1:50-1:500       at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin)        1:200-1:1,000\r\nImmunohistochemistry (frozen)          1:200-1:1,000       Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                           information@sicgen.pt\r\n"
[2] "                                                                          Product Data Sheet\r\n                                                                                             001 Rev1 Jan 2012 by JR\r\nReferences:\r\n                                    For research use only, not for diagnostic use\r\nSICGEN's Proprietary Immunogen Policy\r\nIn order to produce high specific antibodies SICGEN has invested a lot of time and effort into selecting immunogen\r\nsequences. SICGEN has decided to protect this information by not publishing it on the website. However, these sequences\r\nare available on request.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                                  information@sicgen.pt\r\n"

我希望转换为R或excell中的数据框或表。

 Catalogue.No.  Name Source.
1    AB0003-200 ERp57    Goat
2    AB0004-500 (...)   (...)
                                                                                                  General.Description
1 Goat polyclonal to ERp57 -  endoplasmic reticulum lumen marker.  This endoplasmic reticulum protein interacts (...)
2                                                                                                               (...)
                        Alternative.names.
1 58 kDa glucose  regulated protein, (...)
2                                    (...)
                                                               Form.
1 Polyclonal antibody supplied as a  200 µl (2 mg/ml) aliquot in PBS
2                                                              (...)
                                                       Immunogen
1 Recombinant peptide derived  from within residues 300 aa (...)
2                                                          (...)
                       Specificity.                     Reactivity.
1 Detects a band of  60 kDa by(...) Reacts against  human, rat, ...
2                             (...)                           (...)
                                         Usage.
1 Western blot 1:500-1:2,000 Immunofluorescence
2                                         (...)

我想将其格式化为表格格式。 这是从PDF文件导入的。

textImport <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] "                                                                     Product Data Sheet\r\n                                                                                      001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n                                  ERp57 Polyclonal Antibody\r\nSource: Goat                                               phospholipase C alpha, PI PLC, protein disulfide\r\n                                                           isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This                   Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin        (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate           sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has         purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by        Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their            within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates.                                   ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated                Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide              blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident            mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample                Western blot      Immuno-        Histochemistry (paraffin)     Histochemistry (frozen)\r\n                                        fluorescence\r\nhuman                 +++               +++            +++                           +++\r\nrat                   +++               +++            +++                           +++\r\nmouse                 +++               +++            +++                           +++\r\ncanine                +++               +++            +++                           +++\r\nmonkey                +++               +++            +++                           +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot                    1:500-1:2,000       Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence                        1:50-1:500       at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin)        1:200-1:1,000\r\nImmunohistochemistry (frozen)          1:200-1:1,000       Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                           information@sicgen.pt\r\n"
[2] "                                                                          Product Data Sheet\r\n                                                                                             001 Rev1 Jan 2012 by JR\r\nReferences:\r\n                                    For research use only, not for diagnostic use\r\n

如果您有任何建议,请告诉我。

1 个答案:

答案 0 :(得分:0)

无法在评论中发布代码,因此这是使用pdftools和正则表达式的可能方法。

数据

我使用了您提供的相同数据并将其保存到名为“pdf_catalogue.pdf”的pdf中。

<强> CODE

library(pdftools)
u <- pdf_text("pdf_catalogue.pdf")

get_string <- function(pattern, string){
  inter_list <- regmatches(string, regexec(pattern, string))
  if(length(inter_list) > 0){

    replace_patterns_list <- list("\r", "\n") #add others as required
    replace_patterns <- paste(unlist(replace_patterns_list), collapse = "|")

    inter_string <- gsub(replace_patterns, "", inter_list[[1]][2])
    return(inter_string)
  }

}

pat_source <- "Source: (.*)General description"
pat_description <- "General description: (.*)Alternative"
pat_form <- "Form: (.*)Immunogen"
pat_names <- "Alternative names: (.*)Form"

dat <- list(Source = get_string(pat_source, u),
        General_description = get_string(pat_description, u), 
        Form = get_string(pat_source, u), 
        Alternative_names = get_string(pat_names, u))

get_string函数返回(.*)之前和之后字符串之间包含的任何内容。这是基于您的问题所暗示的文件结构一致的假设。如果需要,您可能需要使用(.*?)进行“延迟搜索”。如果您不熟悉正则表达式,Roger Peng会有一个出色的video解释正则表达式。

<强>输出

> dat
$Source
[1] "Goat"

$General_description
[1] "Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker.This endoplasmic reticulum protein interacts with lectin chaperones calreticulin andcalnexin to modulate folding of newly synthesized glycoproteins. It has disulfideisomerase activity and complexes of lectins and this protein mediate protein folding bypromoting formation of disulfide bonds in their glycoprotein substrates."

$Form
[1] "Goat"

$Alternative_names
[1] "58 kDa glucose regulated protein, 58 kDa microsomal protein,disulfide isomerase ER 60, endoplasmic reticulum resident protein 57, endoplasmicreticulum resident protein 60, ER protein 57, ER protein 60, ER protein 61, ERP57,ERp60, ERp61, glucose regulated protein 58 Kd, GRP57, GRP58, HsT17083, P58,PDIA3, phospholipase C alpha, PI PLC, protein disulfide isomerase A3 antibody."

您可能希望根据结构进一步拆分输出。例如,在Alternative names中,名称看起来全部用逗号分隔。你可以试试

> strsplit(dat$Alternative_names, ", ")

给出了

[[1]]
 [1] "58 kDa glucose regulated protein"                   
 [2] "58 kDa microsomal protein,disulfide isomerase ER 60"
 [3] "endoplasmic reticulum resident protein 57"          
 [4] "endoplasmicreticulum resident protein 60"           
 [5] "ER protein 57"                                      
 [6] "ER protein 60"                                      
 [7] "ER protein 61"                                      
 [8] "ERP57,ERp60"                                        
 [9] "ERp61"                                              
[10] "glucose regulated protein 58 Kd"                    
[11] "GRP57"                                              
[12] "GRP58"                                              
[13] "HsT17083"                                           
[14] "P58,PDIA3"                                          
[15] "phospholipase C alpha"                              
[16] "PI PLC"                                             
[17] "protein disulfide isomerase A3 antibody." 

请注意,在逗号(,)之后使用空格会导致第二个元素具有两个名称。您需要使用,来避免此类错误。这对于.pdf文件尤为重要。您还可以通过适当地定义中断(句点后跟大写字母)轻松地将多行划分为单独的字段。正则表达式应该让您解决所有这些用例。

这是一个相当小的示例,但您可以轻松地在其上构建,以涵盖您可能需要的其他字段/组合。

对于多个文件,我建议将所有这些文件包含在一个函数中(一旦完成代码)并使用lapply循环遍历目录。我使用类似的东西来查看.txt和.csv文件。

希望这有帮助。干杯!