PHP - 将两个TEXT文件与条件组合在一起

时间:2011-11-26 12:20:37

标签: php arrays text-files

(对于长期问题提前抱歉 - 问题实际上很简单 - 但要解释它可能不是那么简单)

我在PHP中的诺言技巧受到以下挑战:

输入2个TXT文件,其结构如下:

$rowidentifier //number,letter,string etc..
$some semi-fixed-string $somedelimiter $semi-fixed-string
$content //with unknown length or strings or lines number.

阅读上面的内容,我的意思是“半固定字符串,意味着它是一个具有KNOWN结构的字符串,但是UNKNOWN内容..

举一个实际的例子,让我们拿一个SRT文件(我只是用它作为豚鼠,因为结构与我需要的非常相似):

1
00:00:12,759 --> 00:00:17,458
"some content here "
that continues here

2
00:00:18,298 --> 00:00:20,926
here we go again...

3
00:00:21,368 --> 00:00:24,565
...and this can go forever...

4
.
.
.

我想要做的是从一个文件中取出$ content部分,并将其放在第二个文件的正确位置。

回到示例SRT,有:

//file1 

    1
    00:00:12,759 --> 00:00:17,458
    "this is the italian content "
    which continues in italian here

    2
    00:00:18,298 --> 00:00:20,926
    here we go talking italian again ...

//file2 

    1
    00:00:12,756 --> 00:00:17,433
    "this is the spanish, chinese, or any content "
    which continues in spanish, or chinese here

    2
    00:00:16,293 --> 00:00:20,96
    here we go talking spanish, chinese or german again ...

将导致

//file3 

        1
        00:00:12,756 --> 00:00:17,433
        "this is the italian content "
        which continues in italian here
        "this is the spanish, chinese, or any content "
        which continues in spanish, or chinese here

        2
        00:00:16,293 --> 00:00:20,96
        here we go talking italian again ...
        here we go talking spanish, chinese or german again ...

或更多php喜欢:

$rowidentifier //unchanged
$some semi-fixed-string $somedelimiter $semi-fixed-string //unchanged, except maybe an option to choose if to keep file1 or file2 ...
$content //from file 1
$content //from file 2

所以,在所有这些介绍之后 - 这就是我所拥有的(实际上没有任何东西......)

$first_file = file('file1.txt'); // no need to comment right ?
$second_file = file('file2.txt'); // see above comment
$result_array = array(); /construct array
foreach($first_file as $key=>$value) //loop array and.... 
$result_array[]= trim($value).'/r'.trim($second_file[$key]); //..here is my problem ...

// $Value is $content - but LINE BY LINE , and in our case, it could be 2-3- or even 4 lines
// should i go by delimiters /n/r ??  (not a good idea - how can i know they are there ?? )
// or should i go for regex to lookup for string patterns ? that is insane , no ?

$fp = fopen('merge.txt', 'w+'); fwrite($fp, join("\r\n", $result_array); fclose($fp);

这将逐行进行 - 这不是我需要的。我需要条件.. 另外 - 我确信这不是一个聪明的代码,或者有更好的方法可以实现它 - 所以任何帮助都会受到赞赏......

1 个答案:

答案 0 :(得分:3)

您实际想要做的是并行迭代这两个文件,然后合并属于彼此的部分。

但你不能使用行号,因为那些可能不同。所以你需要使用条目号(块)。所以你需要给它一个“数字”或更精确,以便从文件中逐个输出一个条目。

因此,您需要一个能够将某些行转换为块的数据的迭代器。

所以而不是:

foreach($first_file as $number => $line)

它是

foreach($first_file_blocks as $number => $block)

这可以通过编写自己的迭代器来完成,该迭代器将文件的行作为输入,然后将行转换为块。为此你需要解析数据,这是一个基于状态的解析器的一个小例子,可以将行转换为块:

$state = 0;
$blocks = array();
foreach($lines as $line)
{
    switch($state)
    {
        case 0:
            unset($block);
            $block = array();
            $blocks[] = &$block;
            $block['number'] = $line;
            $state = 1;
            break;
        case 1:
            $block['range'] = $line;
            $state = 2;
            break;
        case 2:
            $block['text'] = '';
            $state = 3;
            # fall-through intended
        case 3:
            if ($line === '') {
                $state = 0;
                break;
            }
            $block['text'] .= ($block['text'] ? "\n" : '') . $line;
            break;
        default:
            throw new Exception(sprintf('Unhandled %d.', $state));
    }
}
unset($block);

它只是沿着线条运行并改变它的状态。基于该状态,每一行都作为其块的一部分进行处理。如果新块开始,则将创建它。它适用于您在问题中概述的SRT文件demo

为了更灵活地使用它,将其转换为迭代器,它在构造函数中使用$lines并在迭代时提供块。这需要一些很少的采用,解析器如何获取行,但它通常是相同的。

class SRTBlocks implements Iterator
{
    private $lines;
    private $current;
    private $key;
    public function __construct($lines)
    {
        if (is_array($lines))
        {
            $lines = new ArrayIterator($lines);
        }
        $this->lines = $lines;
    }
    public function rewind()
    {
        $this->lines->rewind();
        $this->current = NULL;
        $this->key = 0;
    }
    public function valid()
    {
        return $this->lines->valid();
    }
    public function current()
    {
        if (NULL !== $this->current)
        {
            return $this->current;
        }
        $state = 0;
        $block = NULL;
        while ($this->lines->valid() && $line = $this->lines->current())
        {
            switch($state)
            {
                case 0:
                    $block = array();
                    $block['number'] = $line;
                    $state = 1;
                    break;
                case 1:
                    $block['range'] = $line;
                    $state = 2;
                    break;
                case 2:
                    $block['text'] = '';
                    $state = 3;
                    # fall-through intended
                case 3:
                    if ($line === '') {
                        $state = 0;
                        break 2;
                    }
                    $block['text'] .= ($block['text'] ? "\n" : '') . $line;
                    break;
                default:
                    throw new Exception(sprintf('Unhandled %d.', $state));
            }
            $this->lines->next();
        }
        if (NULL === $block)
        {
            throw new Exception('Parser invalid (empty).');
        }
        $this->current = $block;
        $this->key++;
        return $block;
    }
    public function key()
    {
        return $this->key;
    }
    public function next()
    {
        $this->lines->next();
        $this->current = NULL;
    }
}

基本用法如下,输出可以在Demo

中看到
$blocks = new SRTBlocks($lines); 
foreach($blocks as $index => $block)
{
    printf("Block #%d:\n", $index);
    print_r($block);
}

所以现在可以迭代SRT文件中的所有块。现在唯一剩下的就是并行迭代两个SRT文件。从PHP 5.3开始,SPL附带MultipleIterator来执行此操作。它现在非常简单,例如我使用相同的两次:

$multi = new MultipleIterator();
$multi->attachIterator(new SRTBlocks($lines));
$multi->attachIterator(new SRTBlocks($lines));

foreach($multi as $blockPair)
{
    list($block1, $block2) = $blockPair;
    echo $block1['number'], "\n", $block1['range'], "\n", 
        $block1['text'], "\n", $block2['text'], "\n\n";
}

将字符串(而不是输出)存储到文件中相当简单,所以我将其排除在答案之外。

那么要说什么?首先,可以在循环和某些状态中轻松地解析文件中的行等顺序数据。这不仅适用于文件中的行,也适用于字符串。

其次,为什么我在这里建议一个迭代器?首先,它易于使用。从并行处理一个文件到两个文件只需要一小步。接下来,迭代器实际上也可以在另一个迭代器上运行。例如,使用SPLFileObject类。它为文件中的所有行提供了一个迭代器。如果你有大文件,你可以使用SPLFileObject(而不是一个数组),在添加SRTBlocks一小部分删除尾随EOL字符后,你不需要先将这两个文件加载到数组中从每一行的结尾:

$line = rtrim($line, "\n\r");

它只是起作用:

$multi = new MultipleIterator();
$multi->attachIterator(new SRTBlocks(new SplFileObject($file1)));
$multi->attachIterator(new SRTBlocks(new SplFileObject($file2)));

foreach($multi as $blockPair)
{
    list($block1, $block2) = $blockPair;
    echo $block1['number'], "\n", $block1['range'], "\n", 
        $block1['text'], "\n", $block2['text'], "\n\n";
}

完成后你甚至可以使用(几乎)相同的代码处理非常大的文件。灵活,不是吗? The full demonstration