Question

我在文本文件中有以下表格的数据

Userid Gameid Count
Jason  1      2
Jason  2      10
Jason  4      20
Mark   1      2
Mark   2      10
................
.................

共有81个Gameids，我有大约200万个不同的用户。

我想要的是阅读此文本文件并创建格式为

的稀疏矩阵

      Column 1 2  3 4  5 6 .
Row1  Jason  2 10   20
Row2  Mark   2 10

现在我可以在matlab中加载这个文本文件并逐个读取用户，读取他们的计数并初始化稀疏数组。我试过这个，初始化一个用户的行需要1秒钟。因此，对于总共200万用户来说，这将花费我很多时间。

最有效的方法是什么？

这是我的代码

data = sparse(10000000, num_games);
loc = 1;

for f=1:length(files)
  file = files(f).name;

  fid = fopen(file,'r');

  s = textscan(fid,'%s%d%d');

  count = (s(:,2));
  count = count{1};
  position = (s(:,3));
  position = position{1};

  A=s{:,1};
  A=cellstr(A);

  users = unique(A);

  for aa = 1:length(Users)
      a = strfind(A, char(Users(aa)));
      ix=cellfun('isempty',a);
      index = find(ix==0);
      data(loc,position(index,:)) = count(index,:);
      loc = loc + 1;
  end
end

Answer 1

unique再次使用GameID来避免内循环。
存储用户名，因为在原始代码中，您无法分辨哪个名称与每行相关。游戏ID同样如此。
确保在打开文件后关闭该文件。
sparse矩阵不支持'int32'您需要将数据存储为double。

% Place holders for Count
Rows = [];
Cols = [];

for f = 1:length(files)
    % Read the data into 's'
    fid = fopen(files(f).name,'r');
    s = textscan(fid,'%s%f%f');
    fclose(fid);

    % Spread the data
    [U, G, Count{f}] = s{:};

    [Users{f},~, r] = unique(U); % Unique user names
    [GameIDs{f},~,c] = unique(G); % Unique GameIDs

    Rows = [Rows; r + max([Rows; 0])];
    Cols = [Cols; c + max([Cols; 0])];
end

% Convert to linear vectors
Count = cell2mat(Count');
Users = reshape([Users{:}], [], 1);
GameIDs = cell2mat(GameIDs');

% Create the sparse matrix
Data = sparse(Rows, Cols, Count, length(Users), length(GameIDs), length(Count));

Users将包含行标题（用户名）和GameIDs 列标题。

在matlab中有效地加载数据

1 个答案: