鉴于带有混合编码的文件损坏(例如utf-8和latin-1),如何配置Emacs在保存文件时将其所有符号“投影”为单个编码(例如utf-8)?
我做了以下功能来自动化一些清洁,但我猜我可以在某处找到信息,将符号“é”在一个编码中映射到utf-8中的“é”,以改善此功能(或者有人已经写过这样的功能)。
(defun jyby/cleanToUTF ()
"Cleaning to UTF"
(interactive)
(progn
(save-excursion (replace-regexp "अ" ""))
(save-excursion (replace-regexp "आ" ""))
(save-excursion (replace-regexp "ॆ" ""))
)
)
(global-unset-key [f11])
(global-set-key [f11] 'jyby/cleanToUTF)
我有许多文件“损坏”了混合编码(由于从具有错误字体配置的浏览器中复制粘贴),生成下面的错误。我有时通过用“”或相应的字符搜索和替换每个有问题的符号来手动清理它们,或者更快地指定“utf-8-unix”作为编码(这将在下次编辑和保存时提示相同的消息文件)。它已经成为一个问题,因为在任何这样的损坏文件中,任何强调的字符都被在每次保存时加倍的序列所取代,最终使文件的大小加倍。我正在使用GNU Emacs 24.2.1
These default coding systems were tried to encode text
in the buffer `test_accents.org':
(utf-8-unix (30 . 4194182) (33 . 4194182) (34 . 4194182) (37
. 4194182) (40 . 4194181) (41 . 4194182) (42 . 4194182) (45
. 4194182) (48 . 4194182) (49 . 4194182) (52 . 4194182))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these: ...
Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.
Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).
raw-text emacs-mule no-conversion
答案 0 :(得分:2)
我曾多次在emacs中遇到过这种情况。当我有一个混乱的文件时,例如在raw-text-unix模式下,并保存为utf-8,emacs甚至抱怨已经干净的utf-8文本。我还没有办法让它只抱怨非utf-8。
我刚刚使用recode找到了合理的半自动化方法:
f=mixed-file
recode -f ..utf-8 $f > /tmp/recode.out
diff $f recode.out | cat -vt
# manually fix lines of text that can't be converted to utf-8 in $f,
# and re-run recode and diff until the output diff is empty.
一路上有用的工具是http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+224&mode=obytes
然后我只是在emacs中重新打开文件,它被识别为干净的unicode。
答案 1 :(得分:1)
这可能会让你开始:
(put 'eof-error 'error-conditions '(error eof-error))
(put 'eof-error 'error-message "End of stream")
(put 'bad-byte 'error-conditions '(error bad-byte))
(put 'bad-byte 'error-message "Not a UTF-8 byte")
(defclass stream ()
((bytes :initarg :bytes :accessor bytes-of)
(position :initform 0 :accessor position-of)))
(defun logbitp (byte bit) (not (zerop (logand byte (ash 1 bit)))))
(defmethod read-byte ((this stream) &optional eof-error eof)
(with-slots (bytes position) this
(if (< position (length bytes))
(prog1 (aref bytes position) (incf position))
(if eof-error (signal eof-error (list position)) eof))))
(defmethod unread-byte ((this stream))
(when (> (position-of this) 0) (decf (position-of this))))
(defun read-utf8-char (stream)
(let ((byte (read-byte stream 'eof-error)))
(if (not (logbitp byte 7)) byte
(let ((numbytes
(cond
((not (logbitp byte 5))
(setf byte (logand #2r11111 byte)) 1)
((not (logbitp byte 4))
(setf byte (logand #2r1111 byte)) 2)
((not (logbitp byte 3))
(setf byte (logand #2r111 byte)) 3))))
(dotimes (b numbytes byte)
(let ((next-byte (read-byte stream 'eof-error)))
(if (and (logbitp next-byte 7) (not (logbitp next-byte 6)))
(setf byte (logior (ash byte 6) (logand next-byte #2r111111)))
(signal 'bad-byte (list next-byte)))))
(signal 'bad-byte (list byte))))))
(defun load-corrupt-file (file)
(interactive "fFile to load: ")
(with-temp-buffer
(set-buffer-multibyte nil)
(insert-file-literally file)
(with-output-to-string
(set-buffer-multibyte t)
(loop with stream = (make-instance 'stream :bytes (buffer-string))
for next-char =
(condition-case err
(read-utf8-char stream)
(bad-byte (message "Fix this byte %d" (cdr err)))
(eof-error nil))
while next-char
do (write-char next-char)))))
这段代码的作用是 - 它加载一个没有转换的文件并尝试读取它,好像它是使用UTF-8编码的,一旦遇到一个看起来不属于UTF-8的字节,就会出错,你需要以某种方式处理它,它是"Fix this byte"
消息的地方)。但你需要发明一下如何解决它......