Question

I want to do a Markov Chain Monte Carlo simulation. Therefore I need to store the generated states. The problem is that I want to run my program a while and generate a lot of states but MATLAB shows me the 'OUT OF MEMORY' error. Because I don't need to know the full history of my states the whole time (I only need the preceding state to generate the next one) I thought I could store my generated states after each 10000 iteration steps and only keep the last one. Then at the end I want to do some calculations for example like mean, variance and plot a histogram of the generated data and eventually plot the data with the big-data plot: http://de.mathworks.com/matlabcentral/fileexchange/40790-plot-big

I generate a struct, storing the number of dimensions and a vector with coordinates according to this dimension and another vector of the same size. After that initialization I use a for loop to generate my following Markov-Chain

state(1) = struct('dim', 3 ,'coords',rand(3,1), 'vals', rand(3,1));
state(10000) = struct('dim', [], 'coords', [], 'vals', []);
for i = 2:10000
state(i) = generateNewState(state(i-1));
end

How can I store my generated state-data and proceed with the next 10000 states. Then append them to the existing .mat file and go on until I generated say 1e10 states. And then use the data to do calculations?

Answer 1

我得到的第一个想法： 1。也许您可以计算每一步所需的所有变量？例如：

n = 10000;
for i = 1:n
state = generateNewState(stateOld);
mean = mean + state/n;
stateOld = state;
end;

这使您可以在不保存所有状态值的情况下计算平均值。您可以使用直方图执行相同的操作：如果您现在为直方图的轴值，则可以创建数组并计算在此步骤中进入的间隔状态：

if (state < 0.1 )
inter1 = inter1 + 1;
elseif (state < 0.2)
inter2 = inter2 + 1;
...

这是很多代码，但这很有效。等等等等。 2.第二种方式：您可以计算前10000步，将其保存在excel（xlswrite函数）中，然后使用循环重写i增加的状态值。在这种情况下，您保存所有值并以相同的方式加载它。您可以看到我使用double之类的状态而不是struct，但很容易将其实现为struct。

Answer 2

嗯，这是在不久前，但也许它仍然可以帮助一些人。我通过不使用.mat文件解决了我的问题，而是将结构写入csv文件。我创建了一个格式化字符串，它产生了一个表格格式，并根据它将struct元素写入我的带有fprintf()的csv文件中。但在此之前，我必须将我的struct数组转换为一个单元格数组，以便能够轻松访问我的struct内容。为此，我使用了struct2table()然后table2cell()，以便我获得了一些单元格数组temp，然后我调用了fprintf(fid, FormatString, temp{:})。正确地执行它将导致标题并且在每行中产生一个马尔可夫状态，并且马尔可夫状态的每个维度将具有单独的列。

在我的模拟之后，我在Matlab中使用了MapReduce技术来处理我的csv文件。所以基本上我通过map和reduce算法来计算根据均值，方差，计数，最小和最大统计数据来计算。之后，我使用min和max为我的histogramm定义bin边缘，并第二次浏览我的数据，用相应的MapReduce算法计算bin计数。由于这些算法被设计为并行工作，因此通过我的数据迭代2次是没有问题的，该数据由大约5000万马尔可夫链状态组成，其中每个状态可以具有几个100维度。这更多是存储问题，因为csv文件中的这些数据最多可达10 GB或更多。所以当使用FAT或FAT32格式的硬盘时应该小心....有一种更聪明的方式存储数据，但我选择了csv文件，因为它们可以在MATLAB的帮助下处理MapReduce算法。

最后我无法使用文件交换中的plot-big，但我不难提取数据的一个coloumn，这通常仍然适合内存。通过这一个描述每个马尔可夫状态的相应维度的coloumn，可以通过仅绘制每1000个元素左右来绘制该维度的演变。保存图形并关闭它以减少PC的工作量，然后对每个维度重复此操作，以获得模拟数据的概述。我希望它有所帮助。

Store and process big data in matlab

2 个答案: