我这里有一个棘手的情况。我试图通过HDFDotNet API将大型CSV数据集写入H5文件时避免出现内存异常。但是,当我尝试通过与第一次迭代大小相同的文件数据进行第二次循环时,我得到了一个内存不足异常,即使第一次有效,第二次没有,并且使用的内存量应该是远低于~1.2GB的上限。我已经确定了我想要一次读取的块的大小以及由于API的限制我一次需要写入的块的大小。 CSV文件长约105k行,宽500列。
#numpy array containing noise
data_with_noise.shape=(160000,4)
centroids= sk.KMeans(n_clusters=7, init='k-means++', max_iter=1000)
#this data set gives errors
centroids.fit_transform(data_with_noise)
#works well but only for 6 clusters
no_noise_centroids= sk.KMeans(n_clusters=6, init='k-means++', max_iter=1000)
no_noise_centroids.fit(data_no_noise)
我在第二次读取读取部分
时最终遇到了内存不足异常private void WriteDataToH5(H5Writer h5WriterUtil)
{
int startRow = 0;
int skipHeaders = csv.HasColumnHeaders ? 1 : 0;
int readIntervals = (-8 * csv.NumColumns) + 55000;
int numTaken = readIntervals;
while (numTaken == readIntervals)
{
int timeStampCol = HasTimestamps ? 1 : 0;
var readLines = File.ReadLines(this.Filepath)
.Skip(startRow + skipHeaders).Take(readIntervals)
.Select(s => s.Split(new char[] { ',').Skip(timeStampCol)
.Select(x => Convert.ToSingle(x)).ToList()).ToList();
//175k is max number of cells that can be written at one time
//(unconfirmed via API, tested and seems to be definitely less than 200k and 175k works)
int writeIntervals = Convert.ToInt32(175000/csv.NumColumns);
for (int i = 0; i < readIntervals; i += writeIntervals)
{
long[] startAt = new long[] { startRow, 0 };
h5WriterUtil.WriteTwoDSingleChunk(readLines.Skip(i).Take(writeIntervals).ToList()
, DatasetsByNamePair[Tuple.Create(groupName, dataset)], startAt);
startRow += writeIntervals;
}
numTaken = readLines.Count;
GC.Collect();
}
}
在这种情况下,我的读取间隔var将出现在50992,writeIntervals将出现在350左右。谢谢!
答案 0 :(得分:2)
你做了很多不必要的分配:
var readLines = File.ReadLines(this.Filepath)
.Skip(rowStartAt).Take(numToTake)
.Select(s => s.Split(new char[] { ',' }) //why you need to split here ?
.Skip(timeStampCol)
.Select(x => Convert.ToSingle(x)).ToList()).ToList(); //why 2 time ToList() ?
File.ReadLines
返回Enumerator
,因此只需在分割每一行后跳过它,跳过所需的列,然后恢复保存所需的值。
如果内存异常仍在使用少于1.2GB的内存,请考虑以下事项: