Question

我有一个合成数据集，其中有1000个各种阶数的嘈杂多边形和sin / cos曲线，我可以使用python seaborn绘制为线条。

由于我有很多重叠的线，我想绘制某些线图的热图或直方图。我已经尝试迭代列并聚合计数以使用seaborn的热图图表，但是有很多行需要很长时间。

导致我想要的最好的事情是一个hexbin图（带有seaborn jointgraph）。

但它是运行时和粒度之间的折衷（显示的图表有gridsize 750）。我找不到任何其他图形类型的问题。但我也不确切地知道它的名称。

我也尝试将线阵alpha设置为0.2。这导致了与我想要的相似的图形。但它不太精确（如果在同一点重叠超过5行我已经没有透明度）。此外，它错过了热图的典型颜色。

（模拟搜索词为：热图，2D线直方图，线直方图，密度图......）

是否有人知道用于绘制更高效和高质量的软件包，或知道如何使用流行的python绘图仪（即matplotlib系列：matplotlib，seaborn，bokeh）。我对任何套餐都很满意。

Answer 1

我花了一段时间，但我终于用Datashader解决了这个问题。如果使用笔记本，这些图可以嵌入交互式Bokeh图中，看起来非常好。

无论如何，这里是静态图像的代码，以防其他人需要类似的东西：

# coding: utf-8
import time

import numpy as np
from numpy.polynomial import polynomial
import pandas as pd

import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf


plt.style.use("seaborn-whitegrid")

def create_data():
    # ...

# Each column is one data sample
df = create_data()

# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
#   THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286

# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))

# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'

# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)

# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))

# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))

# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)

# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))

# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)

以下运行时间（以秒为单位）：

Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242

结果图：

Answer 2

虽然看起来你已经尝试了这个，但是绘制计数似乎可以很好地表示数据。但是，这真的取决于您在数据中尝试查找的内容，它应该告诉您什么？

长时间运行的原因是由于绘制了如此多的线条，基于计数的热图会很快绘制。

我根据噪音创建了一些用于正弦波的虚拟数据，没有。线，幅度和移位。添加了箱线图和热图。

import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd

np.random.seed(0)

#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
    a = np.random.randint(5, 40)/5 #amplitude
    x = random.choice([int(N/5),  int(N/(2/5))]) #random shift
    sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N))  + np.random.randn(N), x))

fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )

ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))

# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)

# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
    xticks[index].set_visible(False)  # hide ticks where labels are hidden

#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())

# calculate histogram
hists = []
for col in sins.columns:
    count, division = np.histogram(sins.iloc[:,col], bins=bins)
    hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())

cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
          '#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])

#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)

yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()

for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
    if index % 3 != 0: #make some labels invisible
        yticks[index].set_visible(False)  # hide ticks where labels are hidden

plt.show()

虽然箱图很容易理解，但它并没有很好地显示数据的实际分布，但知道中位数和分位数的位置可能会有所帮助。

增加行数和每行的值数量将大大增加线图的绘图时间，虽然生成热图仍然相当快。然而，箱形图变得难以辨认。

我无法完全复制您的数据（或知道它的实际大小），但也许热图可能会有所帮助。

基于线的热图或2D线直方图

2 个答案: