我编写了一个Perl脚本,使用Win32::OLE来阅读Microsoft Word文档内容。
我的问题是包含编号列表的文档(以1,2,3,...开头)。我的Perl脚本无法获得该号码。我只能得到文字内容,而不是数字。
请建议如何将编号列表转换为纯文本,以保留编号和文本。
答案 0 :(得分:7)
我的博文Extract bullet lists from PowerPoint slides using Perl and Win32::OLE 显示了如何使用PowerPoint执行此操作。事实证明,Word的任务有点简单。
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Carp qw( croak );
use Const::Fast;
use Path::Class;
use Try::Tiny;
use Win32::OLE;
use Win32::OLE::Const ('Microsoft.Word');
use Win32::OLE::Enum;
$Win32::OLE::Warn = 3;
run(@ARGV);
sub run {
my $docfile = shift;
# Croaks if it cannot resolve
$docfile = file($docfile)->absolute->resolve;
my $word = get_word();
my $doc = $word->Documents->Open(
{
FileName => "$docfile",
ConfirmConversions => 0,
AddToRecentFiles => 0,
Revert => 0,
ReadOnly => 1,
}
);
my $pars = Win32::OLE::Enum->new($doc->Paragraphs);
while (my $par = $pars->Next) {
print_paragraph($par);
}
}
sub print_paragraph {
my $par = shift;
my $range = $par->Range;
my $fmt = $range->ListFormat;
my $bullet = $fmt->ListString;
my $text = $range->Text;
unless ($bullet) {
say $text;
return;
}
my $level = $fmt->ListLevelNumber;
say ">" x $level, join(' ', $bullet, $text);
return;
}
sub get_word {
my $word;
try { $word = Win32::OLE->GetActiveObject('Word.Application') }
catch { croak $_ };
return $word if $word;
$word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit });
return $word if $word;
croak sprintf('Cannot start Word: %s', Win32::OLE->LastError);
}
鉴于以下Word文档:
它生成输出:
This is a document >1. This is a numbered list >2. Second item in the numbered list >3. Third one Back to normal paragraph. >>a. Another list >>b. Yup, here comes the second item >>c. Not so sure what to put here >>>i. Sub-item
Object Browser是不可或缺的。