Question

我有超过10k的文本文件看起来像这样，所有这些文件的格式相似但尺寸不同，有时候更大或更小。

[{u'language': u'english', u'area': 3825.8953168044045, u'class': u'machine printed', u'utf8_string': u'troia', u'image_id': 428035, u'box': [426.42422762784093, 225.33333055900806, 75.15151515151516, 50.909090909090864], u'legibility': u'legible', u'id': 1056659}, {u'language': u'na', u'area': 24201.285583103767, u'id': 1056660, u'image_id': 428035, u'box': [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606], u'legibility': u'illegible', u'class': u'machine printed'}]

我想使用正则表达式在每个文本中提取两个可变变量。

输出应该是这样的

box  = [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606]
box1 = .. sometime there is more than one

＆安培; 第二个输出

word = troia 
word1 =  ... sometime there is more than one word

我的代码1：用于提取单词

fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);

C = C{:};

Lia = ~cellfun(@isempty, strfind(C,'utf8_string'));

output = [C{find(Lia)}];
expression = 'u''utf8_string'': u+'
matchStr = regexp(output, expression,'match');

我的代码1结果只给我

utf8_string

我的代码2：用于箱号提取

s = sprintf('text_.txt'); 
fid = fopen(s);
tline = fgetl(fid);

C = regexp(tline,'u''box'': +\[([0-9\. ,]+)\]','tokens');
C = cellfun(@(x) x{1},C,'UniformOutput',false)';
M = cell2mat(cellfun(@(x) x', cat(1,C2{:}),'UniformOutput',false));

此代码2正在运行但不是每个文本都出现此错误

Error using cat Dimensions of matrices being concatenated are not consistent

Answer 1

如果你不坚持regexp：输入字符串看起来像json，所以下面的短代码甚至比你想要的更多：

% Read the whole file
s = fileread('test.txt');
% Remove the odd u'
s = strrep(s, 'u''', '''');
% Replace ' by "
s = strrep(s, '''', '"');
% See http://www.mathworks.com/matlabcentral/fileexchange/20565
t = parse_json(s);

现在是一个包含带有数据的结构的单元格对象。所以

word = t{1}.utf8_string;
box = cell2mat(t{1}.box);

会给你第一个单词和框。如果您有更新的Matlab版本，则可以使用jsondecode代替parse_json。

如何提取（可变变量）字和＆amp;数字使用正则表达式matlab

1 个答案: