str_split的输出生成一个列表。如何将列表列表折叠成平面列表?
请参阅以下示例数据:
library(magrittr)
library(dplyr)
url='https://github.com/macarthur-lab/clinvar/raw/master/output/clinvar.tsv.gz'
w=readr::read_tsv(url) #warnings can be safely ignored
w<-w %>% filter(grepl('LabCorp',all_submitters))
#traits are separated by semicolons
ttd<-stringr::str_split(w$all_traits,pattern = ';')
#there are several traits per row from str_split
ttd.l<-sapply(ttd,length)
#sample
ttd[[77]]
[1] "Hereditary cancer-predisposing syndrome"
[2] "Lynch syndrome"
[3] "Lynch Syndrome"
[4] "Neoplastic Syndromes, Hereditary"
[5] "Hereditary non-polyposis colon cancer"
#how to put all 'all-traits' into single vector
这似乎没有这样做:
traits<-lapply(ttd,c)
table(traits)
编辑:简单的unlist(ttd)的问题是我需要在w $ measureset_id中保留行的ID
像这样:
out=data.frame()
for (i in 1:length(ttd)) {
print(i)
#unlist(ttd[[i]])
one<-data.frame(id=w[i,'measureset_id']
,trait=unique(toupper(unlist(ttd[[i]]))))
out<-rbind(out,one)
}
头(下,5)
measureset_id trait
1 36663 CARDIAC ARRHYTHMIA
2 36663 ARRHYTHMIA
3 12779 PHEOCHROMOCYTOMA/PARAGANGLIOMA
4 12779 PHEOCHROMOCYTOMA
5 12779 PARAGANGLIOMAS 4
答案 0 :(得分:3)
你的ttd是一个字符向量列表。如果你想要的是一个长度为3992的所有元素的字符向量,那么你只需要
traits <- unlist(ttd)
答案 1 :(得分:1)
根据您的其他信息,您可以通过以下几种方式进行操作。我在你创建ttd之前就在你的代码中跳了一下,因为这只会让你自己很难过。
library(plyr)
library(dplyr)
#First, create a useful function
getTraits <- function(x) data_frame(trait=unique(unlist(strsplit(x$all_traits, split=";"))))
#Method 1 using plyr
traits <- ddply(w, .(measureset_id), getTraits)
head(traits)
# measureset_id trait
#1 788 Sudden infant death syndrome
#2 788 Brugada syndrome 2
#3 788 Primary familial hypertrophic cardiomyopathy
#4 788 Sudden Infant Death Syndrome
#5 788 Cardiomyopathy
#6 788 Long QT syndrome
traits[traits$measureset_id == 36663, ]
# measureset_id trait
#3231 36663 Cardiac arrhythmia
#3232 36663 Arrhythmia
#Method 2 using dplyr
traitsd <- w %>% group_by(measureset_id) %>% do(getTraits(.))
head(traitsd)
#Source: local data frame [6 x 2]
#Groups: measureset_id [1]
#
# measureset_id trait
# (int) (chr)
#1 788 Sudden infant death syndrome
#2 788 Brugada syndrome 2
#3 788 Primary familial hypertrophic cardiomyopathy
#4 788 Sudden Infant Death Syndrome
#5 788 Cardiomyopathy
#6 788 Long QT syndrome
traitsd[traitsd$measureset_id == 36663, ]
#Source: local data frame [2 x 2]
#Groups: measureset_id [1]
#
# measureset_id trait
# (int) (chr)
#1 36663 Cardiac arrhythmia
#2 36663 Arrhythmia