我有几位患者(患者1,患者2,患者3 ...)的下一代测序数据。
患者样本可以来自相同疾病或不同疾病。我们知道某些疾病中某些突变的发生率较高,某些变异是引起疾病的,其他变异与疾病有关,我们真的不知道它们如何导致疾病等。 我正在寻找一种方法,根据改变后的基因对这些患者进行聚类,以查看是否有任何共同的特征...一个基因可能有几种改变(例如NRAS G12D与NRAS G13D与NRAS Q61K ...)。给定患者中基因改变的顺序无关紧要。一名患者的典型发现是大约500次改变,患者数量大约为100名。
我检查了以前的帖子,问题是关于将组成一个列表的字符串聚类,而不是在多个字符串列表之间。
感谢您的帮助。
一名患者的数据如下:
#Patient1
chromosome <- c("X", "7", "10", "1", "X", "5", "5", "X", "10", "7")
position <- c("70360589","128829066","89692923","11206853","70360680","176637576","176637471","70360648","89692913","148543694")
reference <- c("AGC","A","G","AC","GCA","T","G","CAG","G","AA")
alter <- c("","G","A","","","C","A","","A","")
gene <- c("MED12","SMO","PTEN","MTOR","MED12","NSD1","NSD1","MED12","PTEN","EZH2")
cdot <- c("c.6165_6167delGCA","c.74A>G","c.407G>A","c.4571-6_4571-5delGT","c.6256_6258delCAG","c.2176T>C","c.2071G>A","c.6226_6228delCAG","c.397G>A","c.118-5_118-4delTT")
pdot <- c("Q2076del","D25G","C136Y"," ","Q2086del","S726P","A691T","Q2076del","V133I"," ")
patient1 <- data.frame(chromosome, position, reference, alter, gene, cdot, pdot)
可以用不同的方式表示突变,即cdot基因,gdot基因,ref和alter染色体等。对我来说,最方便的是gene&pdot,因为它能提供更多信息,因为它告诉我改变的基因和耐受性是什么(例如PTEN是基因,C25G表示第25位的参考氨基酸“ C”被更改为氨基酸“ G”)。
我试图将每对Gene&pdot连接成一个字符串,因此,如果患者有10个变化,例如上面的数据框中,我将有10个字符串。我会为所有患者执行此操作,而不是根据他们的更改对所有患者进行聚类。我的问题是在此示例中将多名患者聚类的最佳方法是什么。
另外两名患者:
#Patient2
chromosome <- c("X","6","1","1","6","12","5","X","1","10")
position <- c("47424495","157100024","78429978","242023898","30858801","49427266","176637576","70360648","78435702","89692913")
reference <- c("A","GGA","T","A","C","TGC","T","CAG","AA","G")
alter <- c("","","","G","","","C","","","A")
gene <- c("ARAF","ARID1B","FUBP1","EXO1","DDR1","KMT2D","NSD1","MED12","FUBP1","PTEN")
cdot <- c("c.416delA","c.983_985delGAG","c.901delA","c.836A>G","c.474delC","c.11220_11222delGCA","c.2176T>C","c.6226_6228delCAG","c.121-4_121-3delTT","c.397G>A")
pdot <- c("K139fs","G328del","I301fs","N279S","M159fs","Q3745del","S726P","Q2076del","","V133I")
patient2 <- data.frame(chromosome, position, reference, alter, gene, cdot, pdot)
#Patient3
chromosome <- c("1","2","11","14","14","12","2","19","12","17","X","1","10")
position <- c("120539781","141259448","64572018","35871217","102551161","49426952","29416366","18273047","49426730","29490295","70360648","78435702","89692913")
reference <- c("G","A","T","G","TCT","C","G","T","GCT","G","CAG","AA","G")
alter <- c("A","","C","A","","T","C","C","","A","","","A")
gene <- c("NOTCH2","LRP1B","MEN1","NFKBIA","HSP90AA1","KMT2D","ALK","PIK3R2","KMT2D","NF1","MED12","FUBP1","PTEN")
cdot <- c("c.590C>T","c.8663-5delT","c.1621A>G","c.*2C>T","c.1202_1204delAGA","c.11536G>A","c.4587C>G","c.937T>C","c.11756_11758delAGC","c.380G>A","c.6226_6228delCAG","c.121-4_121-3delTT","c.397G>A")
pdot <- c("T197I","","T541A","","K401del","G3846S","D1529E","S313P","Q3919del","G127E","Q2076del","","V133I")
patient3 <- data.frame(chromosome, position, reference, alter, gene, cdot, pdot)
为了使事情变得简单,我制作了以下示例:
#Simple Example
modules1 <- c("maths", "physics", "geometry", "languages", "science", "geology")
scores1 <- c("A+", "A", "A", "B+", "B", "B")
student1 <- data.frame(modules1, scores1)
modules2 <- c("music", "dance", "languages", "science")
scores2 <- c("A+", "A+", "A+", "B")
student2 <- data.frame(modules2, scores2)
modules3 <- c("languages", "science", "physics", "maths")
scores3 <- c("A+", "A+", "A+", "A")
student3 <- data.frame(modules3, scores3)
如何根据学生的分数将学生1、2和3聚类。 我希望树状图的第1和第3步比学生2靠得更近。
答案 0 :(得分:0)
我建议将数据编码为数字格式。可能是1热编码,因为这是分类数据。
我还将基因和突变编码分开,因为同一基因中的不同突变可能是等效的。
对于以下基因和突变:
list_genes = [gene1, gene2, gene3]
list_disease = [disease1, disease2]
list_mutations_patient1 = [c25g, g149e, t543k]
list_mutations_patient2 = [a50g, "", t543k]
每个列表中的第一个位置是基因中任何突变的真假,随后的位置是数据集中所有已识别突变的真-假,最后一个列表(每个列表中的)是疾病状况:
coded_list_gene_mutation_patient1 = [[1,1,0],[1,1],[1,1],[1,0]]
coded_list_gene_mutation_patient2 = [[1,0,1],[0,0],[1,1],[0,1]]
整理列表并附加所有患者数据
all_patient_lists = [1,1,0,1,1,1,1,1],[1,0,1,0,0,1,0,1]
由于列表可能会很长,因此您应考虑使用降维(PCA或LDA或MDS)。 然后,您可以绘制前2或3个组件的图,以查看它们对数据进行分区的程度,然后将这些组件从PCA传递到真正的聚类算法(而不是分区算法),例如基于层次密度的聚类(HDBScan),
这将把每个样本分配给一个集群,前提是要形成集群的成员数量最少。如果您期望数据中有一些噪声(噪声被分类为离群值而不是被分配给群集),那么这很好。
答案 1 :(得分:-1)
我认为您需要stringdist()
。 strndist软件包提供了“近似字符串匹配和字符串距离函数”。它提供了许多算法,但本例中使用的算法是 Jaro–Winkler距离(Winkler,1990年),用于衡量两个字符串之间的相似性。两根琴弦的Jaro–Winkler距离越高,这些琴弦就越相似。 Jaro–Winkler距离度量标准是为最短字符串(例如人名)而设计的,最适合。将分数标准化,以使0等于无相似性,而1则为完全匹配。
这100个患者数据可以合并为一个数据框。从上面的代码中借用,我将其扩展为;
## Lets merge all data frames into one
df.1<- merge(patient1, patient2 , all = TRUE)
df.master <- merge(df.1, patient3 , all = TRUE)
# replace missing entries with 0
df.master[is.na(df.master)] <- 0
head(df.master, 5)
chromosome position reference alter gene cdot pdot
1 1 11206853 AC MTOR c.4571-6_4571-5delGT
2 1 242023898 A G EXO1 c.836A>G N279S
3 1 78429978 T FUBP1 c.901delA I301fs
4 1 78435702 AA FUBP1 c.121-4_121-3delTT
5 1 120539781 G A NOTCH2 c.590C>T T197I
现在,让我们测量字符串之间的距离。从那以后,您提到兴趣在于gene
和pdot
变量。因此,我按以下方式使用它们;
library(stringdist)
# find the unique genes
uniquegenes <- unique(as.character(df.master$pdot))
# determine the distance between various string using the Jar-Winkler distance
distancemodels <- stringdistmatrix(uniquegenes,uniquegenes,method = "jw")
rownames(distancemodels) <- uniquegenes
# Perform hierarchical clustering
hc <- hclust(as.dist(distancemodels))
# show the plot
plot(hc)
# look at clusters
dfClust <- data.frame(uniquegenes, cutree(hc, k=4))
names(dfClust) <- c('gene_name','cluster')
print(paste('Average number of genes per cluster:', mean(table(dfClust$cluster))))
[1] "Average number of genes per cluster: 5.75"
# the average number of genes per cluster is 5. Lets look at these genes
t <- table(dfClust$cluster)
t <- cbind(t,t/length(dfClust$cluster))
t <- t[order(t[,2], decreasing=TRUE),]
p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
dfClust <- merge(x=dfClust, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
dfClust <- dfClust[rev(order(dfClust$binCount)),]
names(dfClust) <- c('cluster','gene_name')
head (dfClust[c('cluster','gene_name')],5)
cluster gene_name
12 1 S313P
11 1 M159fs
10 1 G127E
9 1 D25G
8 1 K139fs
很明显,簇1是拥有最多基因的最大簇。希望这会有所帮助。