在大文件中解析多行带引号的字符串

时间:2019-05-13 23:25:33

标签: regex perl regex-negation

我正在从事一个Web发布本地化项目,从一个由CMS发布的以英语编写的成熟网站开始。该文件包含用于标识页面的标题,用于标识每个页面的各个部分的子标题,以及用于说明英语网站上原始短语和另一种语言的翻译短语的字符串对。

每个翻译文件仅包含一种语言。因此,对于西班牙语翻译,代表文件摘录如下:

## 3602 Example Page

    ### Title

        'Example Page' => 'Página de ejemplo',

    ### Body

        'This is an example of a string that came from an example page.' => 'Este es un ejemplo de una cadena que proviene de una página de ejemplo.',
        'Parsing this would be relatively simple, except that
occasionally, 
there are carriage returns thrown into the text without warning.' => 'Parsear esto sería relativamente simple, excepto que
ocasionalmente, 
hay retornos de carro lanzados en el texto sin previo aviso.',

    ### Extended


## 3704 About Us

    ### Title

        'About Us' => 'Sobre nosotros',

    ### Body

        'This text takes the place of text which would identify the client.' => 'Este texto toma el lugar del texto que identificaría al cliente.',
        q{I passed the English text though Google Translate. Don't think for a moment that these passages are professionally translated!} => q{Pasé el texto en inglés a través de Google Translate. ¡No piense por un momento que estos pasajes son traducidos profesionalmente!},

    ### Extended


我要做的是编写一个Perl脚本来解析该文件,在CMS中找到该页面,然后用翻译后的字符串替换原始的英语字符串,然后将该页面保存在CMS中以供后续发布。 / p>

我正在使用的CMS具有Perl API,因此整个脚本都是用Perl编写的。

到目前为止,我的方法是一次读取一行文件,并使用正则表达式使用正则表达式来标识文件的重要内容。

此代码的关键部分如下所示:

    while (defined($current_line = <FILE>))
    {
        chomp $current_line;
        $total_lines++;

        ##########
        #
        # We need to parse the file, line-by-line, to determine what each line represents.
        #
        # If the $current_phrase is populated at the beginning of the case statement,
        # we know that the 
        #
        # When we start parsing, $current_page_id is zero (0). If we hit a page selector and
        # the page ID is something other than zero, we need to save the previous page.
        #
        ##########  

        if (length($current_phrase) > 0) {
            if ($current_line =~ /(.*\')\s=>\'(.*)/) {
                $current_phrase .= '\n' . $1;
            }
        }

        elsif ($current_line =~ qr/##\s(\d+)\s.+/mp) {

            ##########
            #
            # $1 is the page ID number.
            #
            ##########

            if ($current_page_id != int($1)) {
                print "\nPage $1 selector\n";
                $current_page_id = int($1);
                $current_page_change_count = 0;
                $current_page_section_name = '';
                $current_page_section_content = '';
                $current_phrase = '';

            }



        } elsif ($current_line =~ qr/###\s(.+)/mp) {

            ##########
            #
            # $1 is the name of the page section.
            #
            # We have to figure out if the page section is the same as the one that we
            # have been processing.
            #
            ##########

            print "\nPage Section Delimiter: " . $1 . "\n";

            if ($1 ne $current_page_section_name) {

                ##########
                #
                # Since $1 is not $current_page_section_name, we need to put
                # $current_page_section_content into the page section where it belongs.
                # 
                # $current_page_section_name refers to the section of the page with changes.
                #
                ##########

                $current_page_section_name = $1;

            }

        } elsif (($current_line =~ qr/'((?:(?>[^'\\]*)|\\.)*)' => '((?:(?>[^'\\]*)|\\.)*)',/mp) || ($current_line =~ qr/q\{((?:(?>[^}\\]*)|\\.*))} => q\{((?:(?>[^}\\]*)|\\.*))},/mp)){

                ##########
                #
                # The complex regular expression above is intended to capture multi-line
                # variants of either the 'phrase' or q{phrase} pattern.
                # 
                # See https://stackoverflow.com/questions/23086883/perl-multiline-string-regex
                # for some idea how the single quote pattern was found. We had to work up the
                # q{phrase} pattern ourselves.
                #
                #
                ##########          

            $current_page_change_count++;
            $total_change_count++;
            print "Phrase " . $current_page_change_count . ", original: " . $1 . ", change to: " . $2 . "\n\n";

        } elsif (($current_line =~ qr/^\s+?\'(.+)[^\'],?\s?/mp) || ($current_line =~ qr/^\s+?q\{(.+)[^}],?\s?/mp)) {

                ##########
                #
                # The biggest unresolved issue with the while loop is how
                # to identify the unterminated strings that begin with
                # a single quote or the q{ construct.
                #
                # The regular expression above is an attempt to match both cases.
                #
                # Eventually, I will have to search for the end of the
                # string, the => construct, and the translated phrase.
                #
                ##########  

            print "Unterminated string: " . $current_line . "\n";
        } elsif (($current_line =~ qr/^\s+/mp) || (length($current_line) == 0)) {
            print "Blank line.\n";
            $total_blank_lines++;
        } else {
            #
            # Want to ignore, not print this.
            print "Something else:  \'" . $current_line . "\'\n";
            #
            $total_blank_lines++;
        }


    }

    print "\nTotal lines: " . $total_lines . "\n";
    print "\nTotal blank lines: " . $total_blank_lines . "\n";
    print "Total change count: " . $total_change_count . "\n";

正如我在代码注释中所说的那样,我遇到的最大问题是制作一个正则表达式,以标识未终止的字符串,这就是我所说的网站英文版中以单引号或q{构造,并且在文本行中的某个位置具有回车终止符。

当前的正则表达式本身不够选择性,但是可以这样做,因为以前的正则表达式会正确选择文件的其他部分。

我在寻找帮助的地方是

  1. 确保此正则表达式具有足够的选择性。
  2. 弄清楚如何累积应成为$current_phrase一部分的所有文本,以便该短语跨越多行。
  3. 找出前进的方向,这样我就可以开发其他正则表达式来识别处理这种性质的文件所必需的其他多行翻译对片段。

如何解决此问题?

1 个答案:

答案 0 :(得分:2)

您的输入具有Perl样式的#comments,Perl样式的胖逗号(用于关联英语和外国文字),甚至是Perl q{}构造。似乎您真的想使用Perl分析此文件。如果是这种情况(并且如果您始终可以信任自己的输入未被恶意篡改),则可以尝试执行以下操作:

@sections = split /^(\s*#[^\n]*)/m, $INPUT; # $INPUT is the whole file
foreach $section (@sections) {
    next unless $section =~ /\S/;
    if ($section =~ /^\s*##\s(\d+)\s.+/) {
        $page_number = $1;
    } elsif ($section =~ /^\s*###\s(.+)/) {
        $page_section = $1;
    } elsif ($section =~ /=>/) {
        %phrases = eval( "($section)" );
        # manipulate keys and values of phrases
    }
}

如果这不是您要遵循的方向,我想您会更乐意使用成熟且经过战斗测试的解析器(如JSON)以标准格式重写输入。

{"source":"en-US", "dest":"es-ES",
[{"pageTitle":"Example Page", "pageNumber":3602,
 "sections":[{"sectionName":"Title", "phrases":{
 "Example Page":"Página de ejemplo"}},
 {"sectionName":"Body","phrases":{
 "This is an example of a string that came from an example page.":
 "Este es un ejemplo de una cadena que proviene de una página de ejemplo.",
 ... }}]]}