Matlab将文本拆分为句子

时间:2014-03-15 17:35:15

标签: regex matlab

我有一个文本文件:

The annual festival. Of every man is the fund which originally.

Supplies it with all the necessaries? And conveniences of birth which

it annually forgone! And which consist always either in the immediate

produce of that action, or in what is wasted with that produce from

other nations.

我需要把它分成句子。它是一个简单的版本,但您可以假设所有句子都以下列. ? !之一结尾,并在其中一个标点符号后面加上空格和大写字母。

我尝试了使用函数strsplit的各种方法,这是关闭的,但仍然是错误的。

strsplit(textfile2,{'. ','! ','? '}) %doesnt work fully

textfil2 = 

'The annual festival'    [1x80 char]    [1x53 char]    [1x133 char]

我希望我的输出位于字符串单元格数组中,如:

The annual festival
Of every man is the fund which originally
Supplies it with all the necessaries
And conveniences of birth which it annually forgone
And which consist always either in the immediate produce of that action, or in what is wasted with that produce from other nations

但每个人都没有结束时间。有什么想法吗?

2 个答案:

答案 0 :(得分:2)

这可以使用MATLAB中的regexp来完成。

text='The annual festival. Of every man is the fund which originally. Supplies it with all the necessaries? And conveniences of birth which it annually forgone! And which consist always either in the immediate produce of that action, or in what is wasted with that produce from other nations.' 
SplitString=regexp(text,'[\.?!,]','split')

for it=1:length(SplitString)
display(SplitString(it));
end

答案 1 :(得分:2)

使用花括号从strsplit

访问单元格数组中的字符数组
x{1}

如果你想在句子末尾保留标点符号:

sentences = regexp(textarray,'\S.*?[\.\!\?]','match')

正确的分割方式,没有尾随标点符号,并保留最后一句:

sentences = regexp(text,'[\.\!\?]\s*','split')

快速检查输出:char(sentences)