在Matlab中更新N-gram 2维单元数组

时间:2014-09-09 13:33:29

database   file  there
da         fi    th
at         il    he
ta         le    er
ab               re


collection = fileread('e:\m.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W',' ');
collection = strtrim(regexprep(collection,'\s*',' '));
temp = regexprep(collection,' ',''',''');
eval(['words = {''',temp,'''};']);

word = char(words(1));
word2 =  regexp(word, sprintf('\\w{1,%d}', 1), 'match');     
bi = cellfun(@(x,y) [x '' y], word2(1:end-1)', word2(2:end)','un',0);




1 个答案:

答案 0 :(得分:1)

如果您希望将单元格数组作为输出,这可能对您有用 -

input_str = 'database file there' %// input

str1_split = regexp(input_str,'\s','Split'); %// split words into cells
NW = numel(str1_split); %// number of words
char_arr1 = char(str1_split'); %//' convert split cells into a char array
ind1 = bsxfun(@plus,[1:NW*2]',[0:size(char_arr1,2)-2]*NW); %//' get indices
                                           %// to be used for indexing into char array
t1 = reshape(char_arr1(ind1),NW,2,[]);
t2 = reshape(permute(t1,[2 1 3]),2,[])'; %//' char array with rows for each pair

out = reshape(mat2cell(t2,ones(1,size(t2,1)),2),NW,[])'; %//'
out(reshape(any(t2==' ',2),NW,[])')={''}; %//' Use only paired-elements cells
out = [str1_split ; out] %// output

代码输出 -

input_str =
database file there

out = 
    'database'    'file'    'there'
    'da'          'fi'      'th'   
    'at'          'il'      'he'   
    'ta'          'le'      'er'   
    'ab'          ''        're'   
    'ba'          ''        ''     
    'as'          ''        ''     
    'se'          ''        ''