R:如何将列表列表(从str_split)折叠到一个列表中并保留一些行数据?

时间:2016-01-05 17:16:07

标签: r dplyr plyr stringr

str_split的输出生成一个列表。如何将列表列表折叠成平面列表?

请参阅以下示例数据:

library(magrittr)   
library(dplyr)


url='https://github.com/macarthur-lab/clinvar/raw/master/output/clinvar.tsv.gz'
w=readr::read_tsv(url) #warnings can be safely ignored
w<-w %>% filter(grepl('LabCorp',all_submitters))
#traits are separated by semicolons
ttd<-stringr::str_split(w$all_traits,pattern = ';')
#there are several traits per row from str_split
ttd.l<-sapply(ttd,length)
#sample
ttd[[77]]
[1] "Hereditary cancer-predisposing syndrome"
[2] "Lynch syndrome"                         
[3] "Lynch Syndrome"                         
[4] "Neoplastic Syndromes, Hereditary"       
[5] "Hereditary non-polyposis colon cancer"  
#how to put all 'all-traits' into single vector

这似乎没有这样做:

traits<-lapply(ttd,c)
table(traits)
编辑:简单的unlist(ttd)的问题是我需要在w $ measureset_id中保留行的ID

像这样:

out=data.frame()
for (i in 1:length(ttd)) {
  print(i)
  #unlist(ttd[[i]])
  one<-data.frame(id=w[i,'measureset_id']
                   ,trait=unique(toupper(unlist(ttd[[i]]))))
  out<-rbind(out,one)
}

头(下,5)

  measureset_id                          trait
1         36663             CARDIAC ARRHYTHMIA
2         36663                     ARRHYTHMIA
3         12779 PHEOCHROMOCYTOMA/PARAGANGLIOMA
4         12779               PHEOCHROMOCYTOMA
5         12779               PARAGANGLIOMAS 4

2 个答案:

答案 0 :(得分:3)

你的ttd是一个字符向量列表。如果你想要的是一个长度为3992的所有元素的字符向量,那么你只需要

traits <- unlist(ttd)

答案 1 :(得分:1)

根据您的其他信息,您可以通过以下几种方式进行操作。我在你创建ttd之前就在你的代码中跳了一下,因为这只会让你自己很难过。

library(plyr)
library(dplyr)

#First, create a useful function
getTraits <- function(x) data_frame(trait=unique(unlist(strsplit(x$all_traits, split=";"))))

#Method 1 using plyr
traits <- ddply(w, .(measureset_id), getTraits)
head(traits)
#  measureset_id                                        trait
#1           788                 Sudden infant death syndrome
#2           788                           Brugada syndrome 2
#3           788 Primary familial hypertrophic cardiomyopathy
#4           788                 Sudden Infant Death Syndrome
#5           788                               Cardiomyopathy
#6           788                             Long QT syndrome
traits[traits$measureset_id == 36663, ]
#     measureset_id              trait
#3231         36663 Cardiac arrhythmia
#3232         36663         Arrhythmia

#Method 2 using dplyr
traitsd <- w %>% group_by(measureset_id) %>% do(getTraits(.))
head(traitsd)
#Source: local data frame [6 x 2]
#Groups: measureset_id [1]
#
#  measureset_id                                        trait
#          (int)                                        (chr)
#1           788                 Sudden infant death syndrome
#2           788                           Brugada syndrome 2
#3           788 Primary familial hypertrophic cardiomyopathy
#4           788                 Sudden Infant Death Syndrome
#5           788                               Cardiomyopathy
#6           788                             Long QT syndrome
traitsd[traitsd$measureset_id == 36663, ]
#Source: local data frame [2 x 2]
#Groups: measureset_id [1]
#
#  measureset_id              trait
#          (int)              (chr)
#1         36663 Cardiac arrhythmia
#2         36663         Arrhythmia