我希望在R中以记录格式读取txt文件作为数据帧,每行对应一条记录。记录的长度各不相同。知道我是怎么做的吗?
这是第一张唱片:
# C. elegans orthologs
# WormBase version: WS241
# Generated:
# File is in record format with records separated by "=\n"
# Sample Record
# WBGeneID \t PublicName \n
# Species \t Ortholog \t MethodsUsedToAssignOrtholog \n
# BEGIN CONTENTS
=
WBGene00000001 aap-1
Ascaris suum GS_11030 WormBase-Compara
Brugia malayi WBGene00227541 WormBase-Compara
Bursephelenchus xylophilus BUX.s00055.227 WormBase-Compara
Caenorhabditis angaria Cang_2012_03_13_00205.g6964.t3 WormBase-Compara
Caenorhabditis brenneri WBGene00194098 TreeFam; WormBase-Compara
Caenorhabditis briggsae WBGene00032086 Hillier-set; OrthoMCL; Inparanoid_7; OMA; WormBase-Compara
Caenorhabditis japonica WBGene00207613 WormBase-Compara
Caenorhabditis remanei WBGene00069407 Inparanoid_7; OMA; TreeFam; WormBase-Compara
Caenorhabditis sp.11 Csp11.Scaffold542.g3421.t1 WormBase-Compara
Caenorhabditis sp.5 Csp5_scaffold_00676.g14307.t1 WormBase-Compara
Danio rerio ENSEMBL:ENSDARP00000056212 TreeFam
Dirofilaria immitis nDi.2.2.2.t01810 WormBase-Compara
Drosophila melanogaster ENSEMBL:FBpp0303635 EnsEMBL-Compara; TreeFam
Haemonchus contortus HCOI02027400.t1 WormBase-Compara
Heterorhabditis bacteriophora Hba_15363 WormBase-Compara
Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
Loa loa EFO26046.2 WormBase-Compara
Meloidogyne hapla MhA1_Contig1573.frz3.gene15 WormBase-Compara
Mus musculus ENSEMBL:ENSMUSP00000034296 EnsEMBL-Compara; TreeFam
Onchocerca volvulus WBGene00241206 WormBase-Compara
Panagrellus redivivus Pan_g2405.t1 WormBase-Compara
Pristionchus pacificus WBGene00117228 Inparanoid_7; OMA; WormBase-Compara
Trichinella spiralis EFV56516 WormBase-Compara
=
WBGene00000002 aat-1
Ascaris suum GS_20881 WormBase-Compara
编辑: 我真正需要的每个记录都是对应于" Homo Sapiens"的条目。所以,理想情况下,我在R中的df将是:
WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam
WBGene00000002 aat-1 etc etc
答案 0 :(得分:1)
我建议使用readLines
将数据读入R.由于您在评论中提供了文件路径,请先使用file
打开与文件的连接,然后{{1} }。在我们读取并将数据存储到R中之后,readLines
连接始终是一个好习惯。
close
示例数据的整个> con <- file("../Input/c_elegans.PRJNA13758.current.best_blastp_hits.txt",
open = "r")
> XX <- readLines(con)
> close(con)
> record <- grep("^WBGene", XX, value = TRUE)
> sapien <- grep("Homo sapiens", XX, value = TRUE, fixed = TRUE)
> gsub("\\s+", " ", paste0(record[1], sapien))
## [1] "WBGene00000001 aap-1 Homo sapiens ENSEMBL:ENSP00000361075 Inparanoid_7; TreeFam"
向量是
record
因此,当我们找到记录2的同性恋者时,它将被粘贴到记录2,sapien 3记录到记录3,依此类推
> record
## [1] "WBGene00000001 aap-1 " "WBGene00000002 aat-1 "
值得注意 OP的数据框最终是用
创建的paste0(record, sapien)
答案 1 :(得分:0)
这可能也有效,使用&#34; scan&#34;:
dat <- matrix(unlist(scan(file = "data",
what = list(""),
sep = "\n",
skip = 8, # file header
multi.line = FALSE)),
ncol = 25, # one record span 25 lines
byrow = TRUE)
paste(dat[,2], dat[,18])
每条实线都被视为一个字段。每行dat都是一条记录,每列都是一行。 (如果需要,可以按每个&#39; \ t&#39;)进行拆分。
最后,我将第2和第18列与感兴趣的第2列结合起来。