我编写了一个代码,用于将输入UCS-2LE文件转换为普通的8位ISO-8859-1文本。转换后,我使用strtok函数将整个文本拆分为单词。现在我对获得的每个单词应用strlen,但是我的单词长度很奇怪,我无法理解。
<?php
$fileData = file('input.txt');
foreach( $fileData as $txt ){
$txt = iconv( 'ISO-8859-1', 'UCS-2LE', $txt );
$tok = strtok($txt, " \n\t");
while ($tok !== false) {
echo 'Word = '.$tok.', Length = '.strlen($tok).'<br />';
$tok = strtok(" \n\t");
}
}
?>
输入文件,文件名= input.txt(在UCS-2LE中)是
Slot# NumJobs ActiveJobID ActiveBatchJob ActiveProcStartTime
0 0 1 input3.dat 7:20 PM
1 0 2 input3.dat 7:20 PM
输出
Word = ÿþSlot#, Length = 24
Word = NumJobs, Length = 31
Word = ActiveJobID, Length = 47
Word = ActiveBatchJob, Length = 59
Word = ActiveProcStartTime , Length = 83
Word = , Length = 1
Word = 0, Length = 6
Word = 0, Length = 7
Word = 1, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = 1, Length = 6
Word = 0, Length = 7
Word = 2, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = , Length = 2
1)如何才能正确显示长度。
2)输出中的第6行是新行字符,未被strtok正确标记。为什么呢?
3)我读了一些关于BOM的内容,我知道文件中的前两个字符用于识别所用字符的格式。有没有办法避免这些字符,比如在第一行输出中,它会显示两个字符。
提前感谢您的帮助。