结合来自here和here的答案我创建了一个函数,用于检查我正在查看的字符是否为EOL。 我需要它用于具有混合行结尾和可能混合编码的字符串。甚至可以通过用\ n
替换所有行结尾来消毒它// check if (possibly multibyte) character is EOL
protected function _is_eol($char) {
static $eols = array(
"\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
"\0x000A", // [UNICODE] LF: Line Feed, U+000A
"\0x000B", // [UNICODE] VT: Vertical Tab, U+000B
"\0x000C", // [UNICODE] FF: Form Feed, U+000C
"\0x000D", // [UNICODE] CR: Carriage Return, U+000D
"\0x0085", // [UNICODE] NEL: Next Line, U+0085
"\0x2028", // [UNICODE] LS: Line Separator, U+2028
"\0x2029", // [UNICODE] PS: Paragraph Separator, U+2029
"\0x0D0A", // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
"\0x0A0D", // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
"\0x0A", // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
"\0x0D", // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
"\0x1E", // [ASCII] RS: QNX (pre-POSIX)
"\0x15" // [EBCDEIC] NEL: OS/390, OS/400
);
$is_eol = false;
foreach($eols as $eol){
if($char === $eol){
$is_eol = true;
break;
}
}
return $is_eol;
}
我可能需要看看下一个字符,当前字符是CR或LF时所以我不会将CRLF或LFCR误认为是两行结尾,但除此之外这对我来说很好。 问题是我不知道编码,也没有数据来测试它。
我的做法是否有致命的错误?
我是否遗漏了来自其他热门encodings的行分隔符?
代码说[UNICODE],但utf8 / 16/32之间没有区别吗?
我在github上找到了这个片段:
if ($this->file_encoding = 'UTF-16LE') {
$this->line_separator = "\x0A\x00";
}
elseif ($this->file_encoding = 'UTF-16BE') {
$this->line_separator = "\x00\x0A";
}
elseif ($this->file_encoding = 'UTF-32LE') {
$this->line_separator = "\x0A\x00\x00\x00";
}
elseif ($this->file_encoding = 'UTF-32BE') {
$this->line_separator = "\x00\x00\x00\x0A";
}
这让我想到,我可能会错过一些。如果我没弄错的话,最后一个"\x00\x00\x00\x0A"
将是"0x0000000A"
?