我有一个包含大约150个Word和PDF(相同文本)文档的文件夹。 数据在这里:http://www.sicgen.pt/antigen_folder/data_sheet/AB0003_ERP57_AB_data_sheet2003.pdf
文字总是像(加载pdftools后):
library(pdftools)
u <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n ERp57 Polyclonal Antibody\r\nSource: Goat phospholipase C alpha, PI PLC, protein disulfide\r\n isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates. ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample Western blot Immuno- Histochemistry (paraffin) Histochemistry (frozen)\r\n fluorescence\r\nhuman +++ +++ +++ +++\r\nrat +++ +++ +++ +++\r\nmouse +++ +++ +++ +++\r\ncanine +++ +++ +++ +++\r\nmonkey +++ +++ +++ +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot 1:500-1:2,000 Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence 1:50-1:500 at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin) 1:200-1:1,000\r\nImmunohistochemistry (frozen) 1:200-1:1,000 Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
[2] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nReferences:\r\n For research use only, not for diagnostic use\r\nSICGEN's Proprietary Immunogen Policy\r\nIn order to produce high specific antibodies SICGEN has invested a lot of time and effort into selecting immunogen\r\nsequences. SICGEN has decided to protect this information by not publishing it on the website. However, these sequences\r\nare available on request.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
我希望转换为R或excell中的数据框或表。
Catalogue.No. Name Source.
1 AB0003-200 ERp57 Goat
2 AB0004-500 (...) (...)
General.Description
1 Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker. This endoplasmic reticulum protein interacts (...)
2 (...)
Alternative.names.
1 58 kDa glucose regulated protein, (...)
2 (...)
Form.
1 Polyclonal antibody supplied as a 200 µl (2 mg/ml) aliquot in PBS
2 (...)
Immunogen
1 Recombinant peptide derived from within residues 300 aa (...)
2 (...)
Specificity. Reactivity.
1 Detects a band of 60 kDa by(...) Reacts against human, rat, ...
2 (...) (...)
Usage.
1 Western blot 1:500-1:2,000 Immunofluorescence
2 (...)
我想将其格式化为表格格式。 这是从PDF文件导入的。
textImport <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n ERp57 Polyclonal Antibody\r\nSource: Goat phospholipase C alpha, PI PLC, protein disulfide\r\n isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates. ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample Western blot Immuno- Histochemistry (paraffin) Histochemistry (frozen)\r\n fluorescence\r\nhuman +++ +++ +++ +++\r\nrat +++ +++ +++ +++\r\nmouse +++ +++ +++ +++\r\ncanine +++ +++ +++ +++\r\nmonkey +++ +++ +++ +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot 1:500-1:2,000 Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence 1:50-1:500 at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin) 1:200-1:1,000\r\nImmunohistochemistry (frozen) 1:200-1:1,000 Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
[2] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nReferences:\r\n For research use only, not for diagnostic use\r\n
如果您有任何建议,请告诉我。
答案 0 :(得分:0)
无法在评论中发布代码,因此这是使用pdftools
和正则表达式的可能方法。
数据强>
我使用了您提供的相同数据并将其保存到名为“pdf_catalogue.pdf”的pdf中。
<强> CODE 强>
library(pdftools)
u <- pdf_text("pdf_catalogue.pdf")
get_string <- function(pattern, string){
inter_list <- regmatches(string, regexec(pattern, string))
if(length(inter_list) > 0){
replace_patterns_list <- list("\r", "\n") #add others as required
replace_patterns <- paste(unlist(replace_patterns_list), collapse = "|")
inter_string <- gsub(replace_patterns, "", inter_list[[1]][2])
return(inter_string)
}
}
pat_source <- "Source: (.*)General description"
pat_description <- "General description: (.*)Alternative"
pat_form <- "Form: (.*)Immunogen"
pat_names <- "Alternative names: (.*)Form"
dat <- list(Source = get_string(pat_source, u),
General_description = get_string(pat_description, u),
Form = get_string(pat_source, u),
Alternative_names = get_string(pat_names, u))
get_string
函数返回(.*)
之前和之后字符串之间包含的任何内容。这是基于您的问题所暗示的文件结构一致的假设。如果需要,您可能需要使用(.*?)
进行“延迟搜索”。如果您不熟悉正则表达式,Roger Peng会有一个出色的video解释正则表达式。
<强>输出强>
> dat
$Source
[1] "Goat"
$General_description
[1] "Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker.This endoplasmic reticulum protein interacts with lectin chaperones calreticulin andcalnexin to modulate folding of newly synthesized glycoproteins. It has disulfideisomerase activity and complexes of lectins and this protein mediate protein folding bypromoting formation of disulfide bonds in their glycoprotein substrates."
$Form
[1] "Goat"
$Alternative_names
[1] "58 kDa glucose regulated protein, 58 kDa microsomal protein,disulfide isomerase ER 60, endoplasmic reticulum resident protein 57, endoplasmicreticulum resident protein 60, ER protein 57, ER protein 60, ER protein 61, ERP57,ERp60, ERp61, glucose regulated protein 58 Kd, GRP57, GRP58, HsT17083, P58,PDIA3, phospholipase C alpha, PI PLC, protein disulfide isomerase A3 antibody."
您可能希望根据结构进一步拆分输出。例如,在Alternative names
中,名称看起来全部用逗号分隔。你可以试试
> strsplit(dat$Alternative_names, ", ")
给出了
[[1]]
[1] "58 kDa glucose regulated protein"
[2] "58 kDa microsomal protein,disulfide isomerase ER 60"
[3] "endoplasmic reticulum resident protein 57"
[4] "endoplasmicreticulum resident protein 60"
[5] "ER protein 57"
[6] "ER protein 60"
[7] "ER protein 61"
[8] "ERP57,ERp60"
[9] "ERp61"
[10] "glucose regulated protein 58 Kd"
[11] "GRP57"
[12] "GRP58"
[13] "HsT17083"
[14] "P58,PDIA3"
[15] "phospholipase C alpha"
[16] "PI PLC"
[17] "protein disulfide isomerase A3 antibody."
请注意,在逗号(,
)之后使用空格会导致第二个元素具有两个名称。您需要使用,
来避免此类错误。这对于.pdf文件尤为重要。您还可以通过适当地定义中断(句点后跟大写字母)轻松地将多行划分为单独的字段。正则表达式应该让您解决所有这些用例。
这是一个相当小的示例,但您可以轻松地在其上构建,以涵盖您可能需要的其他字段/组合。
对于多个文件,我建议将所有这些文件包含在一个函数中(一旦完成代码)并使用lapply
循环遍历目录。我使用类似的东西来查看.txt和.csv文件。
希望这有帮助。干杯!