在分隔符之间随机化文本

时间:2015-12-24 13:02:12

标签: perl shell text-processing text-parsing

我有这个简单的输入

I have {red;green;orange} fruit and cup of {tea;coffee;juice}

我使用Perl识别两个外部大括号分隔符{}之间的模式,并使用内部分隔符;随机化内部字段。

我收到了这个输出

I have green fruit and cup of coffee

这是我的工作Perl脚本

perl -plE 's!\{(.*?)\}!@x=split/;/,$1;$x[rand@x]!ge' <<< 'I have {red;green;orange} fruit and cup of {tea;coffee;juice}'

我的任务是处理此输入格式

I have { {red;green;orange} fruit ; cup of {tea;coffee;juice} } and {nice;fresh} {sandwich;burger}.

根据我的理解,脚本应跳过第一个文本部分中的外部闭括号{ ... },其中包含带开括号和右括号的文本:

{ {red;green;orange} fruit ; cup of {tea;coffee;juice} }

应该选择一个随机的部分,比如

{red;green;orange} fruit

cup of {tea;coffee;juice}

然后它会更深入:

green fruit

处理完所有文本后,结果可能是以下任何一种

I have red fruit and fresh burger.
I have cup of tea and nice sandwich
I have green fruit and nice burger.
I have cup of coffee and fresh burger.

脚本也应解析并随机化下一个文本。例如

This {beautiful;perfect} {image;photography}, captured with the { {NASA;ESA} Hubble Telescope ; {NASA;ESA} Hubble Space Telescope} }, is the {largest;sharpest} image ever taken of the Andromeda galaxy { {— otherwise known as M31;— known as M31}; [empty here] }.
This is a cropped version of the full image and has 1.5 billion pixels. { You would need more than {600;700;800} HD television screens to display the whole image. ; If you want to display the whole image, you need to download more than {1;2} Tb. traffic and use 800 HD displays }

示例输出可以是

This beautiful image, captured with the NASA Hubble Telescope, is the
sharpest image ever taken of the Andromeda galaxy — otherwise known as
M31.
This is a cropped version of the full image and has 1.5 billion
pixels. You would need more than 700 HD television screens to display
the whole image.

3 个答案:

答案 0 :(得分:3)

非贪婪是一个很好的想法,但并不是很有效。你可以添加一个循环:

perl -plE 'while(s!\{([^{}]*)\}!@x=split/;/,$1;$x[rand@x]!ge){}'

请注意,您的示例输入具有无法匹配的大括号,因此这似乎会输出虚假的&#39;}&#39;

答案 1 :(得分:2)

很好的挑战。你需要做的是找到一套没有内部括号的大括号,然后从中选择一个随机项。你需要在全球范围内这样做。这将取代只有“1级”的括号。您需要循环遍历字符串,直到找不到更多匹配项。

use v5.18;
use strict;
use warnings;

sub rand_sentence {
    my $copy = shift;
    1 while $copy =~ s{ \{ ([^{}]+) \} } 
                      { my @words = split /;/, $1; $words[rand @words] }xsge;
    return $copy;
}

my $str = 'I have { {red;green;orange} fruit ; cup of {tea;coffee;juice} } and {nice;fresh} {sandwich;burger}.';
say rand_sentence($str);
say '';

$str = <<'END';
This {beautiful;perfect} {image;photography}, captured with the { {NASA;ESA}
Hubble Telescope ; {NASA;ESA} Hubble Space Telescope }, is the
{largest;sharpest} image ever taken of the Andromeda galaxy { {— otherwise
known as M31;— known as M31}; [empty here] }. This is a cropped version of the
full image and has 1.5 billion pixels. { You would need more than {600;700;800}
HD television screens to display the whole image. ; If you want to display the
whole image, you need to download more than {1;2} Tb.  traffic and use 800 HD
displays }
END

say rand_sentence($str);

示例输出

I have  orange fruit  and fresh sandwich.

This beautiful photography, captured with the  ESA Hubble Space Telescope , is the
largest image ever taken of the Andromeda galaxy  — otherwise
known as M31. This is a cropped version of the
full image and has 1.5 billion pixels.  If you want to display the
whole image, you need to download more than 1 Tb.  traffic and use 800 HD
displays

答案 2 :(得分:0)

TXR解决方案。有很多方法可以解决这个问题。

假设我们正在从标准输入中读取数据。我们如何读取记录中的数据,这些记录不是由通常的换行符分隔的,而是括号选择模式?我们通过在标准输入流上创建记录适配器对象来完成此操作。 record-adapter函数的第三个参数是一个布尔值,表示我们要保留终止分隔符(与记录分隔正则表达式匹配的部分)。

因此,如果数据看起来像foo bar {bra;ces} xyzzy {a;b;c} d\n,则会变成这些记录:foo bar {bra;ces}xyzzy {a;b;c}d\n

然后,我们使用提取语言处理这些记录,就好像它们是文本行一样。它们分为两种模式:以支撑图案结束的线条和不具有线条的线条。后者只是回应。前者按随机支架替换的要求进行处理。

我们还初始化*random-state*,以便在每次运行时播种PRNG以产生不同的伪随机序列。如果make-random-state没有给出参数,它会创建一个从系统参数初始化的随机状态对象,如进程ID和系统时间:

@(do (set *random-state* (make-random-state)))
@(next @(record-adapter #/{[\w;]+}/ *stdin* t))
@(repeat)
@  (cases)
@*text{@switch}
@    (do (put-string `@text@(first (shuffle (split-str switch ";")))`))
@  (or)
@text
@    (do (put-string text))
@  (end)
@(end)

试运行:

$ cat data
I have {red;green;orange} fruit and cup of {tea;coffee;juice}.
$ txr rndchoose.txr < data
I have red fruit and cup of tea.
$ txr rndchoose.txr < data
I have orange fruit and cup of tea.
$ txr rndchoose.txr < data
I have green fruit and cup of coffee.