Question

我在文件中有一组行，其中每行可能代表多行注释。由原始开发人员选择的行分隔符是pilcrow（¶），因为他觉得这不会出现在某人的评论中。我现在将它们放入数据库并希望使用更典型的行分隔符（尽管可以由应用程序安装程序设置）。

问题是某些线路使用ISO-8859-1编码（hex b6），而其他线路使用UTF-8编码（hex c2b6）。我正在寻找一种优雅的方式来解决这个问题，这比我目前正在做的更好。

到目前为止，这是我处理它的方式，但我更愿意寻找更优雅的解决方案：

// Due to the way the quote file is stored, line breaks can either be
// in 2-byte or 1-byte characters for the pilcrow. Since we're dealing
// with them on a unix system, it makes more sense to replace these
// funky characters with a newline character as is more standard.
//
// To do this, however, requires a bit of chicanery. We have to do
// 1-byte replacement, but with a 2-byte character.
//
// First, some constants:
define('PILCROW', '¶'); // standard two-byte pilcrow character
define('SHORT_PILCROW', chr(0XB6)); // the one-byte version used in the source data some places
define('NEEDLE', '/['.PILCROW.SHORT_PILCROW.']/'); // this is what is searched for
define('REPLACEMENT', $GLOBALS['linesep']);

function fix_line_breaks($quote)
{
  $t0 = preg_replace(NEEDLE,REPLACEMENT,$quote); // convert either long or short pilcrow to a newline. 
  return $t0;
}

Answer 1

我会这样做：

define('PILCROW', '¶'); // standard two-byte pilcrow character
define('REPLACEMENT', $GLOBALS['linesep']);

function fix_encoding($quote) {
    return mb_convert_encoding($quote, 'UTF-8', mb_detect_encoding($quote));
}

function fix_line_breaks($quote) {
    // convert UTF-8 pilcrow to a newline.
    return str_replace(PILCROW, REPLACEMENT, $quote);
}

对于每行评论，请致电fix_encoding，然后致电fix_line_breaks

$quote = fix_encoding($quote);
$quote = fix_line_breaks($quote);

处理具有不同字符编码的文件中的字符串（ISO-8859-1与UTF-8）

1 个答案: