Question

是否可以从文件中读取一个UTF-8字符？

file：read（1）返回奇怪的字符，当我打印它时。

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

函数从字符串str返回一个UTF-8字符。我需要以这种方式读取一个UTF-8字符，但是要从输入文件中读取（不要将某些文件读取到内存中 - 通过文件：read（＆＃34; * all＆＃34;））

问题非常类似于这篇文章： Extract the first letter of a UTF-8 string with Lua

Answer 1

function read_utf8_char(file)
  local c1 = file:read(1)
  local ctr, c = -1, math.max(c1:byte(), 128)
  repeat
    ctr = ctr + 1
    c = (c - 128)*2
  until c < 128
  return c1..file:read(ctr)
end

Answer 2

您需要读取字符，以便匹配的字符串总是包含四个或更多字符（这将允许您应用您引用的答案中的逻辑）。如果在匹配并删除UTF-8字符后，长度为len，则可以从文件4-len字符中读取。

ZeroBrane Studio在打印到“输出”面板时用[SYN]字符替换无效的UTF-8字符（如屏幕截图所示）。 This blogpost描述了检测无效UTF-8字符（在Lua中）及其在ZeroBrane Studio中的处理背后的逻辑。

Answer 3

在UTF-8编码中，字符占用的字节数由该字符的第一个字节决定，根据下表（取自RFC 3629：

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

如果第一个字节的最高位为“0”，则该字符只有一个字节。如果最高位为“110”，则该字符有2个字节，依此类推。

然后您可以从文件中读取一个字节，并确定需要读取完整UTF-8字符的连续字节数：

function get_one_utf8_character(file)

  local c1 = file:read(1)
  if not c1 then return nil end

  local ncont
  if     c1:match("[\000-\127]") then ncont = 0
  elseif c1:match("[\192-\223]") then ncont = 1
  elseif c1:match("[\224-\239]") then ncont = 2
  elseif c1:match("[\240-\247]") then ncont = 3
  else
    return nil, "invalid leading byte"
  end

  local bytes = { c1 }
  for i=1,ncont do
    local ci = file:read(1)
    if not (ci and ci:match("[\128-\191]")) then
      return nil, "expected continuation byte"
    end
    bytes[#bytes+1] = ci
  end

  return table.concat(bytes)
end

Lua - 从文件

3 个答案: