Question

我想知道在Tcl中是否有一种简单的方法来读取双字节文件（或者我认为它被调用）。我的问题是我在记事本中打开文件看起来很好（我在Win7上）但是当我在Tcl中读取它们时，每个字符之间都有空格（或者更确切地说是空字符）。

我目前的解决方法是首先运行string map以删除所有空

string map {\0 {}} $file

然后正常处理信息，但通过fconfigure，encoding或其他方式有更简单的方法吗？

我不熟悉编码，所以我不确定我应该使用什么参数。

fconfigure $input -encoding double

当然失败，因为double不是有效的编码。与＆＃39; doublebyte＆＃39;相同。

我实际上正在处理大文本文件（超过2 GB）并正在做我的解决方法＆＃39;在逐行的基础上，所以我认为这会减慢过程。

编辑：正如@mhawke所指出的，该文件是UTF-16-LE编码的，这显然不是支持的编码。有没有一种优雅的方法来绕过这个缺点，可能是通过proc？或者这会比使用string map更复杂吗？

Answer 1

输入文件可能是Windows中常见的UTF-16编码。

尝试：

% fconfigure $input -encoding unicode

您可以使用以下方式获取编码列表：

% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine gb2312 jis0201 euc-cn euc-jp iso8859-10 macThai iso2022-jp jis0208 macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania gb1988 iso2022-kr macTurkish macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 koi8-r iso8859-4 macCroatian ebcdic cp1250 iso8859-5 iso8859-6 macCyrillic cp1251 iso8859-7 cp1252 koi8-u macDingbats iso8859-8 cp1253 cp1254 iso8859-9 cp1255 cp850 cp932 cp1256 cp852 cp1257 identity cp1258 macJapan utf-8 shiftjis cp936 cp855 symbol cp775 unicode cp857

Answer 2

我决定写一个小程序来转换文件。我正在使用while循环，因为将3 GB文件读入单个变量会完全锁定该过程...注释使它看起来很长，但不会那么长。

proc itrans {infile outfile} {
  set f [open $infile r]

  # Note: files I have been getting have CRLF, so I split on CR to keep the LF and
  # used -nonewline in puts
  fconfigure $f -translation cr -eof ""

  # Simple switch just to remove the BOM, since the result will be UTF-8
  set bom 0                              
  set o [open $outfile w]
  while {[gets $f l] != -1} {
    # Convert to binary where the specific characters can be easily identified
    binary scan $l H* l

    # Ignore empty lines
    if {$l == "" || $l == "00"} {continue}

    # If it is the first line, there's the BOM
    if {!$bom} {
      set bom 1

      # Identify and remove the BOM and set what byte should be removed and kept
      if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} {
        regsub -- "^$m" $l "" l

        if {[string toupper $m] eq "FFFE"} {
          set re "(..).."
        } elseif {[string toupper $m] eq "FEFF"} {
          set re "..(..)"
        }
      }
      regsub -all -- $re $l {\1} new
    } else {
      # Regardless of utf-16-le or utf-16-be, that should work since we split on CR
      regsub -all -- {..(..)|00$} $l {\1} new
    }
    puts -nonewline $o [binary format H* $new]
  }
  close $o
  close $f
}

itrans infile.txt outfile.txt

最终警告，这将使实际使用所有16位的字符陷入混乱（例如，代码单元序列04 30将丢失04并变为30而不是成为D0 B0 { {3}}，但00 4D会在一个角色中正确地映射到4D），所以在尝试上述内容之前，请确保您不介意或者您的文件不包含此类字符

读取双字节文件

2 个答案: