如何通过填充缺失值从稀疏矩阵生成填充热图

时间:2019-09-18 16:38:22

标签: r sparse-matrix heatmap missing-data

数据框具有3列,即id,days和sum。我想生成一个总和的热图,y轴为id,x轴为天。问题在于数据稀疏,因此热图由离散条组成。我希望条形图向右延伸,以使条形图变为实线,并在总和更改值时更改颜色,并保持该颜色直到第二天的值右侧。

这是一个生成我正在制作的情节类型的示例。

library(ggplot2)

set.seed(13)
x_id <- sample( LETTERS[1:5], 100, replace=TRUE, 
                prob=c(0.15, 0.2, 0.35, 0.1, 0.2) )
x_sum <- sample( c(5, 30, 60, 120, 180, 240, 360), 100, replace=TRUE, 
                   prob=c(.1, .1, .2, .2, .2, .1, .1) )
x_days <- sample.int(2000, 100, replace = TRUE)-1000

df <- data.frame(id = x_id, Days = x_days, sum = x_sum)

ggp <- ggplot(data = df, 
       mapping = aes(x = Days, 
                     y = id, 
                     fill = sum)) +
  geom_tile() +
  xlab(label = "Days") + ylab(label = 'id') +
  scale_fill_gradient(low = "blue", high = "red") 
print(ggp)

Example of sparse heatmap

我希望颜色向右延伸。我认为这意味着数据框应按id和days排序,并且必须为每个id添加其他行,以便用sum和id的值等于sum / id的最后一个值来填写缺少的日期。但是,如何为每个ID添加行并填写缺少的值?最右边的颜色应延长固定的长度,这样颜色才更可见,例如延长30天。

此外,颜色图显示指示临界值。假设临界值为180。然后,对于从零到临界值(180)的总和,颜色应从绿色(0)变为黄色(179),对于大于临界值(180)的值,颜色应为浅红色(180)到深红色(最大值或360)

更新:

这是一种填充稀疏矩阵的解决方案

library(tidyr)

setkey(DT, id, Days)
DT_fill_NA <- DT[setkey(DT[, .(min(Days):(max(Days)+30)), by = id], id, V1)]

DT_fill <- fill(DT_fill_NA, c('sum'), .direction = "down")

ggp <- ggplot(data = DT_fill, 
              mapping = aes(x = Days, 
                            y = id, 
                            fill = sum)) +
  geom_tile() +
  xlab(label = "Days") + ylab(label = 'id') +
  scale_fill_gradient(low = "blue", high = "red") 
print(ggp)

这将创建具有稀疏条的图形,该稀疏条向右延伸到下一个条

Sparse Heatmap Filled to the Right

现在应修改颜色图以指示临界值。设临界值为180。然后,对于从零到临界值(180)的和,颜色应从绿色(0)变为黄色(179),对于大于临界值(180)的和,颜色应从浅红色(180)到深红色(最大值或360)

第二次更新

一种在180处中断时生成绿色的方法如下

ggp <- ggplot(data = DT_fill, 
              mapping = aes(x = Days, 
                            y = id, 
                            fill = sum)) +
  geom_tile() +
  xlab(label = "Days") + ylab(label = 'id') +
  scale_fill_gradient2(low = "green", mid = "indianred2", high = "red2", 
                         midpoint = 180, breaks = c(50, 100, 200, 300)) +
  theme_bw()

print(ggp)

Sparse data extended right highlighting break point

我不确定这是否清楚地将断点标识为特定值。怎样才能使绿色/红色之间的临界点正确地设置为临界值(180)?

1 个答案:

答案 0 :(得分:0)

这是一种从稀疏矩阵中突出显示临界值的填充热图的方法。

library(ggplot2)
library(data.table)
library(tidyr)

set.seed(13)
n_rows = 200
x_id <- sample( LETTERS[1:5], n_rows, replace=TRUE, 
                prob=c(0.15, 0.2, 0.35, 0.1, 0.2) )
x_sum <- sample(        c(0,  5, 30, 60, 120, 180, 240, 270, 360), n_rows, replace=TRUE, 
                 prob=c(.05, .05, .1, .2, .2,  .2,  .1,  05, .05) )
x_days <- sample.int(2000, n_rows, replace = TRUE)-1000

DT <- data.table(id = x_id, Days = x_days, sum = x_sum)

setkey(DT, id, Days)
DT_fill_NA <- DT[setkey(DT[, .(min(Days):(max(Days)+100)), by = id], id, V1)]

DT_fill <- fill(DT_fill_NA, c('sum'), .direction = "down")


brks = c(-1, 50, 100, 180, 250, 300, max(DT_fill$sum))
DT_fill$sum_factors = cut(DT_fill$sum, breaks = brks, ordered_result = TRUE, right = TRUE)
unique(DT_fill$sum_factors)

ggp <- ggplot(data = DT_fill, 
              mapping = aes(x = Days, 
                            y = id, 
                            fill = sum_factors)) +
  geom_tile() +
  xlab(label = "Days") + ylab(label = 'id') +
  scale_fill_manual(values = c("green4", "green3", "green", 
                               "firebrick1", "firebrick3", "firebrick4")) +
  theme_bw()

print(ggp)

enter image description here