有没有一种通过相似性将多组字符串聚类的方法?

时间:2019-05-07 20:28:05

标签: cluster-analysis dna-sequence mutation

我有几位患者(患者1,患者2,患者3 ...)的下一代测序数据。

患者样本可以来自相同疾病或不同疾病。我们知道某些疾病中某些突变的发生率较高,某些变异是引起疾病的,其他变异与疾病有关,我们真的不知道它们如何导致疾病等。 我正在寻找一种方法,根据改变后的基因对这些患者进行聚类,以查看是否有任何共同的特征...一个基因可能有几种改变(例如NRAS G12D与NRAS G13D与NRAS Q61K ...)。给定患者中基因改变的顺序无关紧要。一名患者的典型发现是大约500次改变,患者数量大约为100名。

我检查了以前的帖子,问题是关于将组成一个列表的字符串聚类,而不是在多个字符串列表之间。

感谢您的帮助。

一名患者的数据如下:

    #Patient1
    chromosome <- c("X",    "7",    "10",   "1",    "X",    "5",    "5",    "X",    "10",   "7")
    position <- c("70360589","128829066","89692923","11206853","70360680","176637576","176637471","70360648","89692913","148543694")
    reference <- c("AGC","A","G","AC","GCA","T","G","CAG","G","AA")
    alter <- c("","G","A","","","C","A","","A","")
    gene <- c("MED12","SMO","PTEN","MTOR","MED12","NSD1","NSD1","MED12","PTEN","EZH2")
    cdot <- c("c.6165_6167delGCA","c.74A>G","c.407G>A","c.4571-6_4571-5delGT","c.6256_6258delCAG","c.2176T>C","c.2071G>A","c.6226_6228delCAG","c.397G>A","c.118-5_118-4delTT")
    pdot <- c("Q2076del","D25G","C136Y"," ","Q2086del","S726P","A691T","Q2076del","V133I"," ")
    patient1 <- data.frame(chromosome, position, reference, alter, gene, cdot, pdot)

可以用不同的方式表示突变,即cdot基因,gdot基因,ref和alter染色体等。对我来说,最方便的是gene&pdot,因为它能提供更多信息,因为它告诉我改变的基因和耐受性是什么(例如PTEN是基因,C25G表示第25位的参考氨基酸“ C”被更改为氨基酸“ G”)。

我试图将每对Gene&pdot连接成一个字符串,因此,如果患者有10个变化,例如上面的数据框中,我将有10个字符串。我会为所有患者执行此操作,而不是根据他们的更改对所有患者进行聚类。我的问题是在此示例中将多名患者聚类的最佳方法是什么。

另外两名患者:

    #Patient2
    chromosome <- c("X","6","1","1","6","12","5","X","1","10")
    position <- c("47424495","157100024","78429978","242023898","30858801","49427266","176637576","70360648","78435702","89692913")
    reference <- c("A","GGA","T","A","C","TGC","T","CAG","AA","G")
    alter <- c("","","","G","","","C","","","A")
    gene <- c("ARAF","ARID1B","FUBP1","EXO1","DDR1","KMT2D","NSD1","MED12","FUBP1","PTEN")
    cdot <- c("c.416delA","c.983_985delGAG","c.901delA","c.836A>G","c.474delC","c.11220_11222delGCA","c.2176T>C","c.6226_6228delCAG","c.121-4_121-3delTT","c.397G>A")
    pdot <- c("K139fs","G328del","I301fs","N279S","M159fs","Q3745del","S726P","Q2076del","","V133I")
    patient2 <- data.frame(chromosome, position,  reference, alter, gene, cdot, pdot)


    #Patient3
    chromosome <- c("1","2","11","14","14","12","2","19","12","17","X","1","10")
    position <- c("120539781","141259448","64572018","35871217","102551161","49426952","29416366","18273047","49426730","29490295","70360648","78435702","89692913")
    reference <- c("G","A","T","G","TCT","C","G","T","GCT","G","CAG","AA","G")
    alter <- c("A","","C","A","","T","C","C","","A","","","A")
    gene <- c("NOTCH2","LRP1B","MEN1","NFKBIA","HSP90AA1","KMT2D","ALK","PIK3R2","KMT2D","NF1","MED12","FUBP1","PTEN")
    cdot <- c("c.590C>T","c.8663-5delT","c.1621A>G","c.*2C>T","c.1202_1204delAGA","c.11536G>A","c.4587C>G","c.937T>C","c.11756_11758delAGC","c.380G>A","c.6226_6228delCAG","c.121-4_121-3delTT","c.397G>A")
    pdot <- c("T197I","","T541A","","K401del","G3846S","D1529E","S313P","Q3919del","G127E","Q2076del","","V133I")
    patient3 <- data.frame(chromosome, position,  reference, alter, gene, cdot, pdot)

为了使事情变得简单,我制作了以下示例:

    #Simple Example
    modules1 <- c("maths", "physics", "geometry", "languages", "science", "geology")
    scores1 <- c("A+", "A", "A", "B+", "B", "B")
    student1 <- data.frame(modules1, scores1)
    modules2 <- c("music", "dance", "languages", "science")
    scores2 <- c("A+", "A+", "A+", "B")
    student2 <- data.frame(modules2, scores2)
    modules3 <- c("languages", "science", "physics", "maths")
    scores3 <- c("A+", "A+", "A+", "A")
    student3 <- data.frame(modules3, scores3)

如何根据学生的分数将学生1、2和3聚类。 我希望树状图的第1和第3步比学生2靠得更近。

2 个答案:

答案 0 :(得分:0)

我建议将数据编码为数字格式。可能是1热编码,因为这是分类数据。

我还将基因和突变编码分开,因为同一基因中的不同突变可能是等效的。

对于以下基因和突变:

list_genes = [gene1, gene2, gene3]
list_disease = [disease1, disease2]
list_mutations_patient1 = [c25g, g149e, t543k]
list_mutations_patient2 = [a50g, "", t543k]

每个列表中的第一个位置是基因中任何突变的真假,随后的位置是数据集中所有已识别突变的真-假,最后一个列表(每个列表中的)是疾病状况:

coded_list_gene_mutation_patient1 = [[1,1,0],[1,1],[1,1],[1,0]]
coded_list_gene_mutation_patient2 = [[1,0,1],[0,0],[1,1],[0,1]]

整理列表并附加所有患者数据

all_patient_lists = [1,1,0,1,1,1,1,1],[1,0,1,0,0,1,0,1]

由于列表可能会很长,因此您应考虑使用降维(PCA或LDA或MDS)。 然后,您可以绘制前2或3个组件的图,以查看它们对数据进行分区的程度,然后将这些组件从PCA传递到真正的聚类算法(而不是分区算法),例如基于层次密度的聚类(HDBScan),

这将把每个样本分配给一个集群,前提是要形成集群的成员数量最少。如果您期望数据中有一些噪声(噪声被分类为离群值而不是被分配给群集),那么这很好。

答案 1 :(得分:-1)

我认为您需要stringdist()strndist软件包提供了“近似字符串匹配和字符串距离函数”。它提供了许多算法,但本例中使用的算法是 Jaro–Winkler距离(Winkler,1990年),用于衡量两个字符串之间的相似性。两根琴弦的Jaro–Winkler距离越高,这些琴弦就越相似。 Jaro–Winkler距离度量标准是为最短字符串(例如人名)而设计的,最适合。将分数标准化,以使0等于无相似性,而1则为完全匹配。

这100个患者数据可以合并为一个数据框。从上面的代码中借用,我将其扩展为;

## Lets merge all data frames into one
df.1<- merge(patient1, patient2 , all = TRUE)
df.master <- merge(df.1, patient3 , all = TRUE)
# replace missing entries with 0
df.master[is.na(df.master)] <- 0
head(df.master, 5)
  chromosome  position reference alter   gene                 cdot   pdot
1          1  11206853        AC         MTOR c.4571-6_4571-5delGT       
2          1 242023898         A     G   EXO1             c.836A>G  N279S
3          1  78429978         T        FUBP1            c.901delA I301fs
4          1  78435702        AA        FUBP1   c.121-4_121-3delTT       
5          1 120539781         G     A NOTCH2             c.590C>T  T197I

现在,让我们测量字符串之间的距离。从那以后,您提到兴趣在于genepdot变量。因此,我按以下方式使用它们;

library(stringdist)

# find the unique genes
uniquegenes <- unique(as.character(df.master$pdot))
# determine the distance between various string using the Jar-Winkler distance
distancemodels <- stringdistmatrix(uniquegenes,uniquegenes,method = "jw")
rownames(distancemodels) <- uniquegenes
# Perform hierarchical clustering
hc <- hclust(as.dist(distancemodels))
# show the plot
plot(hc)

Rplot-SOquestion-00

# look at clusters
dfClust <- data.frame(uniquegenes, cutree(hc, k=4))
names(dfClust) <- c('gene_name','cluster')
print(paste('Average number of genes per cluster:', mean(table(dfClust$cluster))))
[1] "Average number of genes per cluster: 5.75"

# the average number of genes per cluster is 5. Lets look at these genes
t <- table(dfClust$cluster)
t <- cbind(t,t/length(dfClust$cluster))
t <- t[order(t[,2], decreasing=TRUE),]
p <- data.frame(factorName=rownames(t), binCount=t[,1], percentFound=t[,2])
dfClust <- merge(x=dfClust, y=p, by.x = 'cluster', by.y='factorName', all.x=T)
dfClust <- dfClust[rev(order(dfClust$binCount)),]
names(dfClust) <-  c('cluster','gene_name')
head (dfClust[c('cluster','gene_name')],5)
   cluster gene_name
12       1     S313P
11       1    M159fs
10       1     G127E
9        1      D25G
8        1    K139fs

很明显,簇1是拥有最多基因的最大簇。希望这会有所帮助。