在R中以记录格式打开txt文件

时间:2014-05-01 20:01:37

标签: r

我希望在R中以记录格式读取txt文件作为数据帧,每行对应一条记录。记录的长度各不相同。知道我是怎么做的吗?

这是第一张唱片:

# C. elegans orthologs      
# WormBase version: WS241       
# Generated:        
# File is in record format with records separated by "=\n"      
#      Sample Record        
#      WBGeneID \t PublicName \n        
#      Species \t Ortholog \t MethodsUsedToAssignOrtholog \n        
# BEGIN CONTENTS        
=       
WBGene00000001  aap-1   
Ascaris suum    GS_11030    WormBase-Compara
Brugia malayi   WBGene00227541  WormBase-Compara
Bursephelenchus xylophilus  BUX.s00055.227  WormBase-Compara
Caenorhabditis angaria  Cang_2012_03_13_00205.g6964.t3  WormBase-Compara
Caenorhabditis brenneri WBGene00194098  TreeFam; WormBase-Compara
Caenorhabditis briggsae WBGene00032086  Hillier-set; OrthoMCL; Inparanoid_7; OMA;     WormBase-Compara
Caenorhabditis japonica WBGene00207613  WormBase-Compara
Caenorhabditis remanei  WBGene00069407  Inparanoid_7; OMA; TreeFam; WormBase-Compara
Caenorhabditis sp.11    Csp11.Scaffold542.g3421.t1  WormBase-Compara
Caenorhabditis sp.5 Csp5_scaffold_00676.g14307.t1   WormBase-Compara
Danio rerio ENSEMBL:ENSDARP00000056212  TreeFam
Dirofilaria immitis nDi.2.2.2.t01810    WormBase-Compara
Drosophila melanogaster ENSEMBL:FBpp0303635 EnsEMBL-Compara; TreeFam
Haemonchus contortus    HCOI02027400.t1 WormBase-Compara
Heterorhabditis bacteriophora   Hba_15363   WormBase-Compara
Homo sapiens    ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
Loa loa EFO26046.2  WormBase-Compara
Meloidogyne hapla   MhA1_Contig1573.frz3.gene15 WormBase-Compara
Mus musculus    ENSEMBL:ENSMUSP00000034296  EnsEMBL-Compara; TreeFam
Onchocerca volvulus WBGene00241206  WormBase-Compara
Panagrellus redivivus   Pan_g2405.t1    WormBase-Compara
Pristionchus pacificus  WBGene00117228  Inparanoid_7; OMA; WormBase-Compara
Trichinella spiralis    EFV56516    WormBase-Compara
=       
WBGene00000002  aat-1   
Ascaris suum    GS_20881    WormBase-Compara

编辑: 我真正需要的每个记录都是对应于" Homo Sapiens"的条目。所以,理想情况下,我在R中的df将是:

WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam 
WBGene00000002 aat-1 etc etc

2 个答案:

答案 0 :(得分:1)

我建议使用readLines将数据读入R.由于您在评论中提供了文件路径,请先使用file打开与文件的连接,然后{{1} }。在我们读取并将数据存储到R中之后,readLines连接始终是一个好习惯。

close

示例数据的整个> con <- file("../Input/c_elegans.PRJNA13758.current.best_blastp_hits.txt", open = "r") > XX <- readLines(con) > close(con) > record <- grep("^WBGene", XX, value = TRUE) > sapien <- grep("Homo sapiens", XX, value = TRUE, fixed = TRUE) > gsub("\\s+", " ", paste0(record[1], sapien)) ## [1] "WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam" 向量是

record

因此,当我们找到记录2的同性恋者时,它将被粘贴到记录2,sapien 3记录到记录3,依此类推

> record
## [1] "WBGene00000001  aap-1   " "WBGene00000002  aat-1   "

值得注意 OP的数据框最终是用

创建的
paste0(record, sapien)

答案 1 :(得分:0)

这可能也有效,使用&#34; scan&#34;:

dat <- matrix(unlist(scan(file     = "data",
                      what         = list(""),
                      sep          = "\n",
                      skip         = 8, # file header
                      multi.line   = FALSE)),
          ncol  = 25, # one record span 25 lines
          byrow = TRUE)
paste(dat[,2], dat[,18])

每条实线都被视为一个字段。每行dat都是一条记录,每列都是一行。 (如果需要,可以按每个&#39; \ t&#39;)进行拆分。

最后,我将第2和第18列与感兴趣的第2列结合起来。