Question

我有一些计算写入文件并读入数据框，其排列如下：

sequence_1  sequence_2  identity
CP010953    CP010953    100
CP010953    CP012689    73.9
CP010953    CP000025    73.86
CP010953    CP012149    73.77
CP010953    HE978252    73.72999999999999
CP010953    CP009043    83.35000000000001

数据来自计算（在Python中），它计算两个字符串之间的字符匹配数除以其中一个字符串的长度（两个字符串将具有相同的长度）。这在当时似乎是个好主意，但是当我进行计算时，我使用了itertools.combinations_with_replacement命令来更快地进行计算。因此，如果我比较3个字符串（a，b，c），它只会比较a＆amp; b，a＆amp; c，b＆amp; c，而不是b＆amp; a，c＆amp; a和c＆amp; b，因为它们分别是具有与＆amp; b，a＆amp; c和b＆amp; c相同的值。问题是，当我将数据读入R并绘制热图时，我最终得到了这个：

这是一堆空隙（你可能会看到我需要的值都在那里 - 例如：AL111168和CP000538（都位于左下角）在y轴上有值，但是不是x轴）！

有没有办法用R中的适当值填补这些空白？我可以在循环中执行此操作，但这不是非常R-esque。我确信之前有人问过，但我认为我没有使用正确的搜索字词。

以下是我的一些代码：

args = commandArgs(trailingOnly=TRUE)

file_name <- args[1]
gene_name <- args[2]

image_name = paste(gene_name, '.png', sep='')

myDF <- read.csv(file_name, header=T, sep='\t')   

my_palette <- colorRampPalette(c('red', 'yellow', 'green'))

png(filename=image_name, width=3750,height=2750,res=300)
par(mar=c(9.5,4.3,4,2))
print(corpus <- qplot(x=sequence_1, y=sequence_2, data=myDF, fill=identity, geom='tile') +

                    geom_text(aes(label=identity), color='black', size=3) + 
                    scale_fill_gradient(limits=c(0, 100), low='gold', high='green4') +
                    labs(title='Campylobacter Pair-wise Sequence Identity Comparison', x=NULL, y=NULL) +
                    guides(fill = guide_legend(title = 'Sequence\nSimilarity %', title.theme = element_text(size=15, angle = 0))) + theme(legend.text=element_text(size=12))  +
                    theme(axis.text.x=element_text(angle=45, size=14, hjust=1, colour='black'), axis.text.y=element_text(size=14, hjust=1, colour='black')) )
dev.off()

提前谢谢。

Answer 1

我找到了办法。

mDF <- myDF
colnames(mDF)[1] <- 'sequence_2'
colnames(mDF)[2] <- 'sequence_1'
newDF <- rbind(mDF, myDF)

然后绘制newDF。

R：填写数据帧以创建对称标识图

1 个答案: