我有一个如下数据集:
import numpy as np
from pandas import DataFrame
mypos = np.random.randint(10, size=(100, 2))
mydata = DataFrame(mypos, columns=['x', 'y'])
myres = np.random.rand(100, 1)
mydata['res'] = myres
res变量是连续的,x和y变量是表示的整数 职位(因此很大程度上是重复的), 和res表示位置对之间的相关性。
我想知道可视化此数据集的最佳方法是什么? 已经考虑过的可能方法:
当职位数量变大时,第一种方法是有问题的, 因为res变量的高值(这是我们关心的值)会被淹没在海中 小点。
第二种方法可能很有希望,但我在制作它时遇到了麻烦。 我尝试过pandas模块中的parallel_coordinates函数, 但它并不像我希望的那样表现。 (在这里看到这个问题: parallel coordinates plot for continous data in pandas )
答案 0 :(得分:1)
我希望这有助于在R中找到解决方案。祝你好运。
# you need this package for the colour palette
library(RColorBrewer)
# create the random data
dd <- data.frame(
x = round(runif(100, 0, 10), 0),
y = round(runif(100, 0, 10), 0),
res = runif(100)
)
# pick the number of colours (granularity of colour scale)
nColors <- 100
# create the colour pallete
cols <-colorRampPalette(colors=c("white","blue"))(nColors)
# get a zScale for the colours
zScale <- seq(min(dd$res), max(dd$res), length.out = nColors)
# function that returns the nearest colour given a value of res
findNearestColour <- function(x) {
colorIndex <- which(abs(zScale - x) == min(abs(zScale - x)))
return(cols[colorIndex])
}
# the first plot is the scatterplot
### this has problems because points come out on top of eachother
plot(y ~ x, dd, type = "n")
for(i in 1:dim(dd)[1]){
with(dd[i,],
points(y ~ x, col = findNearestColour(res), pch = 19)
)
}
# this is your parallel coordinates plot (a little better)
plot(1, 1, xlim = c(0, 1), ylim = c(min(dd$x, dd$y), max(dd$x, dd$y)),
type = "n", axes = F, ylab = "", xlab = "")
for(i in 1:dim(dd)[1]){
with(dd[i,],
segments(0, x, 1, y, col = findNearestColour(res))
)
}