1. ZFP112
Official Symbol: ZFP112 and Name: zinc finger protein 112 homolog (mouse)[Homo sapiens]
Other Aliases: ZNF112, ZNF228
Other Designations: zfp-112; zinc finger protein 112; zinc finger protein 228
Chromosome: 19; Location: 19q13.2
Annotation: Chromosome 19NC_000019.9 (44830706..44860856, complement)
ID: 7771
2. SEP15
15 kDa selenoprotein[Homo sapiens]
Chromosome: 1; Location: 1p31
Annotation: Chromosome 1NC_000001.10 (87328128..87380107, complement)
MIM: 606254
ID: 9403
3. MLL4
myeloid/lymphoid or mixed-lineage leukemia 4[Homo sapiens]
Other Aliases: HRX2, KMT2B, MLL2, TRX2, WBP7
Other Designations: KMT2D; WBP-7; WW domain binding protein 7; WW domain-binding protein 7; histone-lysine N-methyltransferase MLL4; lysine N-methyltransferase 2B; lysine N-methyltransferase 2D; mixed lineage leukemia gene homolog 2; myeloid/lymphoid or mixed-lineage leukemia protein 4; trithorax homolog 2; trithorax homologue 2
Chromosome: 19; Location: 19q13.1
Annotation: Chromosome 19NC_000019.9 (36208921..36229779)
MIM: 606834
ID: 9757
37. LOC100509547
hypothetical protein LOC100509547[Homo sapiens]
This record was discontinued.
ID: 100509547
43. LOC100509587
hypothetical protein LOC100509587[Homo sapiens]
Chromosome: 6
This record was replaced with GeneID: 100506601
ID: 100509587
我想得到基因名称(ZFP112,SEP15,MLL4),位置字段(如果存在),ID字段,并跳过其他内容。所有字符串实用程序(如scan())似乎都面向更常规的数据。记录之间的空白行实际上是记录分隔符。我可以将它写入磁盘并使用readLines()读回来,但我更喜欢从内存中执行此操作,因为我是通过HTTP下载的。
答案 0 :(得分:4)
从"myfile.dat"
读取数据,例如,如果您之前已将其作为单独的行读取,则从下面的L
开始。现在提取那些以数字开头,后跟一个点后跟一个空格或包含单词Location:
或以ID:
开头的行。然后删除这些行中的所有内容,包括最后一个空格。创建一个组向量g
,用于标识v2
的每个组件所属的组。 (我们使用了这样一个事实,即每个组的起始字段以非数字开头,其他字段以数字开头。)然后将v2
拆分为这些组。通过适当插入NA来扩展s
的短组件,假设它的短路Location:
缺失。 (我们假设第一个字段和ID
字段不能丢失。)最后转置它以使字段在列中并且案例在行中。
L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)
g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)
使用帖子中的示例数据,我们从最后一行得到了这个:
[,1] [,2] [,3]
1 "ZFP112" "19q13.2" "7771"
2 "SEP15" "1p31" "9403"
3 "MLL4" "19q13.1" "9757"
4 "LOC100509547" NA "100509547"
5 "LOC100509587" NA "100509587"