说我有以下数据S =
Year Week Postcode
2009 24 2035
2009 24 4114
2009 24 4127
2009 26 4114
2009 26 4556
2009 27 7054
2009 27 6061
2009 27 4114
2009 27 2092
2009 27 2315
2009 27 7054
2009 27 4217
2009 27 4551
2009 27 2035
2010 1 4132
2010 1 2155
2010 5 4114 ... (>60000 rows)
在Matlab中,我想创建一个矩阵:
第1栏:年(2006-2014)
第2栏:一周(每年1-52)
然后下一个n
列是唯一的邮政编码,其中每列中的数据都会计算我的数据S
中的匹配项。
例如:
year week 2035 4114 4127 4556 7054
2009 24 1 1 1 0 0
2009 25 0 0 0 0 0
2009 26 0 1 0 1 0
2009 27 1 1 0 0 2
2009 28 0 0 0 0 0
谢谢,如果你能提供帮助!
答案 0 :(得分:1)
这是一个实现此列表的工作脚本。输出位于data
表中。你应该:
代码,完整评论说明:
% Use rng for repeatability in rand, n = num data entries
rng('default')
n = 100;
% Set up test data. You would use 3 equal length vectors of real data here
years = floor(rand(n,1)*9 + 2006); % random integer between 2006,2014
weeks = floor(rand(n,1)*52 + 1); % random integer between 1, 52
postcodes = floor(rand(n,1)*10)*7 + 4000; % arbitrary integers over 4000
% Create year/week values like 2017.13, get unique indices
[~, idx, ~] = unique(years + weeks/100);
% Set up table with year/week data
data = table();
data.Year = years(idx);
data.Week = weeks(idx);
% Get columns
uniquepostcodes = unique(postcodes);
% Cycle over unique columns, assign data
for ii = 1:numel(uniquepostcodes)
% Variable names cannot start with a numeric value, make start with 'p'
postcode = ['p', num2str(uniquepostcodes(ii))];
% Create data column variable for each unique postcode
data.(postcode) = zeros(size(data.Year,1),1);
% Count occurences of postcode in each date row
% This uses logical indexing of original data, looking for all rows
% which satisfy year and week of current row, and postcode of column.
for jj = 1:numel(data.Year)
data.(postcode)(jj) = sum(years == data.Year(jj) & ...
weeks == data.Week(jj) & ...
postcodes == uniquepostcodes(ii));
end
end
% Sort week/year data so all is chronological
data = sortrows(data, [1,2]);
% To check all original data was counted, you could run
% sum(sum(table2array(data(:,3:end))))
% ans = n, means that all data points were counted somewhere
在我的电脑上,n = 60,000
只需不到2.4秒。几乎肯定可以进行优化,但对于可能不经常使用的东西,这似乎是可以接受的。
相对于唯一邮政编码的数量,处理时间呈线性增长。这是因为循环结构。因此,如果你将独特的邮政编码加倍(20而不是我的10个例子),则时间接近4.8秒 - 两倍长。
如果这样可以解决您的问题,请考虑接受此问题作为答案。