我想绘制一个相当小的IoT-CSV-数据集,大约2gb。它具有以下尺寸(〜20.000,〜18.000)。每列都应成为其自身的y轴的子图。我使用以下代码生成图片:
times = pd.date_range('2012-10-01', periods=2000, freq='2min')
timeseries_array = np.array(times);
cols = random.sample(range(1, 2001), 2000)
values = []
for col in cols:
values.append(random.sample(range(1,2001), 2000))
time = pd.DataFrame(data=timeseries_array, columns=['date'])
graph = pd.DataFrame(data=values, columns=cols, index=timeseries_array)
fig, axarr = plt.subplots(len(graph.columns), sharex=True, sharey=True,
constrained_layout=True, figsize=(50,50))
fig.autofmt_xdate()
for i, ax in enumerate(axarr):
ax.plot(time['date'], graph[graph.columns[i]].values)
ax.set(ylabel=graph.columns[i])
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
myFmt = mdates.DateFormatter('%d.%m.%Y %H:%M')
ax.xaxis.set_major_formatter(myFmt)
ax.label_outer()
print('--save-fig--')
plt.savefig(name, dpi=500)
plt.close()
但是这是如此之慢,对于100个子图来说,花了大约1分钟,而2000年大约是20分钟。好吧,我的机器实际上有10个内核和35 GB内存。您对我有任何加快流程的提示吗?可以做多线程吗?如我所见,这仅使用一个核心。有一些技巧可以只画相关的东西吗?还是有另一种方法可以更快地绘制该图,并且全部绘制在一个图中而没有子图?
答案 0 :(得分:1)
感谢@Asmus, 我想出了这个解决方案,使我从20分钟降低到40秒(2000,2000)。对于像我这样的初学者,我没有找到任何有据可查的好的解决方案,因此我在这里发布了我的数据库,用于时间序列和大量列:
def print_image_fast(name="default.png", graph=[]):
int_columns = len(graph.columns)
#enlarge our figure for every 1000 columns by 30 inch, function well with 500 dpi labelsize 2 and linewidth 0.1
y_size = (int_columns / 1000) * 30
fig = plt.figure(figsize=(10, y_size))
ax = fig.add_subplot(1, 1, 1)
#set_time_formatter for timeseries
myFmt = mdates.DateFormatter('%d.%m.%Y %H:%M')
ax.xaxis.set_major_formatter(myFmt)
#store the label offsets
y_label_offsets = []
current = 0
for i, col in enumerate(graph.columns):
#last max height of the column before
last = current
#current max value of the column and therefore the max height on y
current = np.amax(graph[col].values)
if i == 0:
#y_offset to move the graph along the y axis, starting with column 0 the offset is 0
y_offset = 0
else:
#add the last y_offset (aggregated y_offset from the columns before) + the last offset + 1 is our new Y - zero point to start drawing the new graph
y_offset = y_offset + last + 1
#our label offset is always our current y_offset + half of our height (half of current max value)
y_offset_label = y_offset + (current / 2)
#append label position to array
y_label_offsets.append(y_offset_label)
#plot our graph according to our offset
ax.plot(graph.index.values, graph[col].values + y_offset,
'r-o', ms=0.1, mew=0, mfc='r', linewidth=0.1)
#set boundaries of our chart, last y_offset + full current is our limit for our y-value
ax.set_ylim([0, y_offset+current])
#set boundaries for our timeseries, first and last value
ax.set_xlim([graph.index.values[0], graph.index.values[-1]])
#print columns with computed positions to y axis
plt.yticks(y_label_offsets, graph.columns, fontsize=2)
#print our timelabels on x axis
plt.xticks(fontsize=15, rotation=90)
plt.savefig(name, dpi=500)
plt.close()
//编辑: 对于任何有兴趣的人,一个(20k,20k)的数据帧都会污染我的ram约20gb。而且我不得不将savefig更改为svg,因为Agg无法处理大于2 ^ 16像素的尺寸