Question

我有一个大字符串（大约25M个字符），我需要在其中替换特定模式的多个子字符串。

Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

我需要删除的子字符串是'Frame＃'，它出现在7670次左右。我可以使用单元格数组在strrep中给出多个搜索字符串

strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')

然而，它返回一个单元格数组，在每个单元格中，我有原始字符串，其输入单元格的一个的相应子字符串已更改。

除了使用regexprep之外，有没有办法从字符串中替换多个子字符串？我注意到它比strrep慢得多，这就是我试图避免它的原因。

使用regexprep，它将是：

regexprep(text,'Frame \d*',';')

对于25MB的字符串，更换所有实例大约需要47秒。

编辑1 ：添加了等效的regexprep命令

编辑2 ：添加字符串的大小以供参考，子字符串的出现次数和regexprep的执行时间

Answer 1

好的，最后我找到了解决问题的方法。我没有使用regexprep来更改子字符串，而是删除了'Frame'子字符串（包括空格，但不是数字）

rawData = strrep(text,'Frame ','');

这导致如下：

1 0,0,0,0,0,1,2,34,0 0,1,2,3,34,12,3,4,0 ........... 2 0,0,0,0,0,1,2,34,0 0,1,2,3,34,12,3,4,0 ........... 7670 0,0,0,0,0,1,2,34,0 0,1,2,3,34,12,3,4,0 ...........

然后，我将所有逗号（，）和换行符（\ n）更改为分号（;），再次使用strrep，然后创建一个包含所有数字的大向量

rawData = strrep(rawData,sprintf('\r\n'),';'); rawData = strrep(rawData,';;',';'); rawData = strrep(rawData,';;',';'); rawData = strrep(rawData,',',';'); rawData = textscan(rawData,'%f','Delimiter',';');

然后我删除了不必要的数字（1,2，...，7670），因为它们位于数组中的特定点（每个帧包含特定数量的数字）。

rawData{1}(firstInstance:spacing:lastInstance)=[];

然后我继续我的操纵。似乎额外的strrep和从数组中删除值比等效的regexprep快得多。使用带有regexprep的25M字符串，我可以在大约47英寸内执行整个操作，而使用此解决方法只需要5“！

希望这会有所帮助。

Answer 2

使用regular expressions：

result = regexprep(text,'Frame [0-9]+','');

可以避免使用如下正则表达式。我使用strrep和适当的替换字符串作为掩码。所获得的串是等长的并且确保对齐，因此可以使用掩模将其组合成最终结果。我还包括了你想要的;。我不知道它是否会比regexprep更快，但它肯定更有趣:-)

% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths

% Computations    
rep_dest = cellfun(@(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';' 
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters

使用这些示例数据：

>> text

text =

Hello Frame 1 test string Frame 22 end of Frame 2 this

>> result

result =

Hello ; test string ; end of ; this

Answer 3

我认为这可以仅使用textscan来完成，已知速度非常快。指定'CommentStyle' 'Frame #'行被删除。这可能仅适用，因为这些'Frame #'行都在各自的行上。此代码将原始数据作为一个大向量返回：

s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}

您可能想知道每帧中有多少元素，甚至将数据重新整形为矩阵。您可以再次使用textscan（或在上面之前）仅获取第一帧的数据：

f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}

事实上，如果你只想要第一行的元素，你可以使用它：

l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}

然而，关于textscan的另一个好处是，您可以使用它来直接读取文件（看起来您可能正在使用其他一些方法）仅使用fopen来获取FID。因此，字符串数据text不必在内存中。

在Matlab中使用strrep替换多个子串

3 个答案: