Question

我需要获取.csv文件的内容并替换其中的一些字符串。为此，我使用以下简单代码：

    $pattern = "test";
    $replacement = "Replacement";
    $string = file_get_contents( "myDoc.csv" );
    $string =  str_replace( $pattern, $replacement, $string );

然而，$ string持有：

echo $string; // Outputs: This is my test 
var_dump($string); // Outputs: string(32) "��This is my test"

我发现该文件采用的是UCS-2 LE BOM编码。如果我将文件转换为另一种编码，我可能会丢失一些符号/字符。

文件必须具有相同的格式，并且不会以任何方式修改内容（目标字符串除外）。

如何更换字符串以便我不会丢失信息？

Answer 1

剥离物料清单

byte order mark (BOM)是文件开头的字节序列。例如，对于UTF-8，它是三个字节的序列：0xEF,0xBB,0xBF。对于UTF-16 Little-endian（LE），BOM表示为两个字节：0xFF 0xFE。因此，您只需使用正则表达式即可将其删除。 E.g：

function stripUtf8Bom($string) {
    return preg_replace('/^\xef\xbb\xbf/', '', $string);
}

function stripUtf16Le($string) {
    return preg_replace('/^\xff\xfe/', '', $string);
}

function stripUtf16Be($string) {
    return preg_replace('/^\xfe\xff/', '', $string);
}

子串替换

str_replace等标准字符串函数不支持多字节字符。请改用mbstring函数：

mbstring旨在处理基于Unicode的编码，例如UTF-8 和UCS-2以及许多单字节编码以方便...

您可能会发现mb_ereg_replace功能特别有用。

PHP以UCS-2 LE BOM编码获取文件内容并替换其中的字符串

1 个答案:

剥离物料清单

子串替换