PHP - 检查EOL的字符

时间:2014-06-29 01:15:58

标签: php unicode encoding utf-8 eol

结合来自herehere的答案我创建了一个函数,用于检查我正在查看的字符是否为EOL。 我需要它用于具有混合行结尾和可能混合编码的字符串。甚至可以通过用\ n

替换所有行结尾来消毒它
// check if (possibly multibyte) character is EOL
protected function _is_eol($char) {
    static $eols = array(
            "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
            "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
            "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
            "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
            "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
            "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
            "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
            "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
            "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
            "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
            "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
            "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
            "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
            "\0x15"        // [EBCDEIC] NEL: OS/390, OS/400
    );
    $is_eol = false;
    foreach($eols as $eol){
        if($char === $eol){
            $is_eol = true;
            break;
        }
    }
    return $is_eol;
}

我可能需要看看下一个字符,当前字符是CR或LF时所以我不会将CRLF或LFCR误认为是两行结尾,但除此之外这对我来说很好。 问题是我不知道编码,也没有数据来测试它。

我的做法是否有致命的错误?
我是否遗漏了来自其他热门encodings的行分隔符? 代码说[UNICODE],但utf8 / 16/32之间没有区别吗? 我在github上找到了这个片段:

if ($this->file_encoding = 'UTF-16LE') {
    $this->line_separator = "\x0A\x00";
}
elseif ($this->file_encoding = 'UTF-16BE') {
    $this->line_separator = "\x00\x0A";
}
elseif ($this->file_encoding = 'UTF-32LE') {
    $this->line_separator = "\x0A\x00\x00\x00";
}
elseif ($this->file_encoding = 'UTF-32BE') {
    $this->line_separator = "\x00\x00\x00\x0A";
}

这让我想到,我可能会错过一些。如果我没弄错的话,最后一个"\x00\x00\x00\x0A"将是"0x0000000A"

0 个答案:

没有答案