Question

我在perl变量中有一个很长的字符串，该字符串有500多个单词。

$mytext = "This text goes on and on and on........";

基本上，此长字符串可以包含所有内容，包括各种特殊字符。它可以包括特殊字符（例如撇号-这是cleo的业务部门），数字（例如-2001年8月2日合并），逗号，分号和撇号（例如-通过不同的部门，业务收入），特殊字符（例如“＆”，单引号和双引号）

我的要求是从字符串的开头提取特定数量的单词（不是字符）。例如，我可能需要选择前200个单词。我知道有一个内置的substr函数：

substr($mytext, $start, $length)

但会提取字符数。
我该如何提取单词数呢？

Answer 1

您可以使用the split function来完成此操作：

它接受一个正则表达式：这里\W+每次遇到非单词字符（或此类字符序列）时都会分割字符串
它提供了一个选项，用于限制将切割字符串的次数（这有效地控制了输出中具有的最大部分数）。

代码：

my $mytext = "This text goes on and on and on........";
my $nb_words = 20;
my @words = split(/\W+/, $mytext, $nb_words + 1);
pop @words; # the last item contains the remaining of the string

Answer 2

如果您需要文本的一部分包含前N个单词，并带有所有空格，标点符号等

my $text = q(one two, three-four five etc);
my $n = 4;

my ($subtext) = $text =~ /((?:\w+.*?){$n})/; 
say $subtext;

带有文本字符串

one two, three-four

调整您在正则表达式中认为的“单词”。例如，如果可以使用连字符，请将\w+更改为[\w-]+（在这种情况下，three-four是一个“单词”，因此five也加入了该字符）

如果您需要单词列表，除了显示的split外，还可以使用正则表达式“标记”

my $n = 4;
my @words;

push @words, $1 while $text =~ /(\w+)/g and @words < $n;
say "@words";

对于

one two three four

如果您的“单词”不是字母，数字和下划线，那么您将再次更改\w的地方。

Answer 3

如果可以用所有非空格字符定义一个单词，则可以执行以下操作：

my $str = <<'EOD';
Basically, this lengthy string can contain anything and everything including all kinds of special characters. It can include special characters (like apostrophes - it's a division of cleo's business), numbers (like - incorporated on August 2, 2001), commas, semicolons and apostrophe's (like - through its different divisions, the business's earnings), special characters (like '&', single and double quotes)
EOD

my ($wd) = $str =~ /((?:\S+\s+){1,30})/; # I've limited the length at 30 for testing.
say $wd;

输出：

基本上，此长字符串可以包含所有内容，包括各种特殊字符。它可以包含特殊字符（例如撇号-这是cleo的业务部门），数字

Perl-从字符串中捕获特定数量的单词

3 个答案: