随着打开文件读取额外的字符

时间:2018-01-12 16:50:03

标签: lisp common-lisp

我试图在Common Lisp中将文件读入字符串(而不是列表),但我最终会在字符串的结尾。仅当文件包含换行符或制表符等字符时才会发生这种情况;空白似乎工作得很好。 这是我的代码:

(defun load-file (filename)
  (with-open-file (stream filename 
                          :direction :input 
                          :if-does-not-exist :error)
    (let ((contents (make-string (file-length stream))))
      (read-sequence contents stream)
       contents)))

请注意:遗憾的是,我不允许在此程序中同时使用循环或外部库。

1 个答案:

答案 0 :(得分:8)

这是一个老问题,答案是'不要那样做'。这样做的原因是file-length无法在许多有趣的情况下做你想做的事情。特别是,file-length的版本以您期望的方式工作,返回文件中的字符数,只有在满足以下一个或两个条件时才能轻松实现:

  • 文件中的字符数是文件中字节数的固定倍数;
  • 您正在使用的操作系统会为您记录文件中的字符数。

可悲的是,对于我所知道的任何现代平台,这些事情都不适用:

  • 文件中的字符数不是其中字节数的固定倍数,原因至少有两个:

    • 行尾编码表示文件可能在行尾(#\Return #\Newline)包含两个字符,这些字符将被视为一个;
    • 文件可以使用不以任何简单方式将字节映射到字符上的编码,例如UTF-8,与行结束序列完全不同;
  • 但操作系统只告诉您文件中的字节数。

对于这样的平台,file-length告诉您对于您正在阅读的文件字符串要知道的内容的唯一方法是读取并解码整个文件 ,这显然是不可取的。在实践中,file-length仅告诉您文件的字节长度。

所以这个“计算出文件长度并将其放入一个大块中”的技巧通常无法正常工作,因为文件的字符长度无法知道不读它。

稍微烦人(我认为CL的轻微缺陷)它不包括一个函数,其合约是“读取此文件并返回包含它的字符串”。

我认为,至少对于常见的编码,情况是文件的字符长度永远不会长于字节长度。因此,如果你愿意生活有点危险,你可以做的一件事就是分配一个数组,它是文件的字节长度,读取文件,然后注意你填充的数组的数量(为了增加聪明,使用一个可调节阵列,并在读取后调整为合适的长度。)

请注意,Alexandria包含一个函数read-file-into-string,它可以执行您想要的功能,并且可移植且可能很快。

这是一个相当天真的版本,我认为在大多数情况下都可以使用(它根本不考虑字符串的元素类型):

(defun file->string (f &key (buffer-size 1024))
  (with-open-file (in f :direction :input)
    (with-output-to-string (out)
      (loop with buffer = (make-string buffer-size)
            for nchars = (read-sequence buffer in)
            do (write-sequence buffer out :start 0 :end nchars)
            while (= nchars buffer-size)))))

这是一个经过部分测试,更加毛茸茸的功能,它试图变得更加聪明,并且处理文件比字符更短的情况(即使在文件上也可能出现这种情况)正在阅读时附加到)。处理此问题的代码分支尚未经过测试: caveat emptor

在大多数情况下,这也会减少数据的复制,但它返回的字符串通常会有一些浪费的空间。它假设填充指针很便宜(它们应该是)并且调整数组的大小只能作为最后的手段接受:所以当需要缩短字符串时,它通过设置填充指针而不是通过调整大小来实现,只调整大小它需要延长时间。

它还温和地假设尾调用已经过优化。

(defun file->string (f &key (element-type ':default)
                       (external-format ':default)
                       (growth-factor 0.1))
  "Read a file into a string, dealing with character encoding issues"
  ;; This attempts to be efficient: it allocates a string which, if
  ;; there are slightly fewer characters than bytes in the file (which
  ;; is the case for common encodings, will be a little too large,
  ;; then reads the file into it in one fell swoop, setting the
  ;; fill-pointer correctly after doing so if needed.  It also
  ;; attempts to deal with the case where the file is *shorter* in
  ;; bytes than it is in characters (this might be true if the file
  ;; was being appended to as the read is happening, or on some
  ;; platform which compresses files and reports the compressed
  ;; length), although this part of the code is untested.
  ;;
  ;; I am not sure if the use of LISTEN here is really right.
  ;;
  (with-open-file (in f :direction :input
                      :element-type element-type
                      :external-format external-format)
    (let* ((l (file-length in))
           (buf (make-array (list l)
                            :element-type (stream-element-type in)
                            :adjustable t :fill-pointer t))
           (n (read-sequence buf in)))
      (cond ((< n l)
             ;; Just make the array seem a bit shorter: this is the
             ;; common case for things like UTF-8 and DOS line endings
             (adjust-array buf (list l) :fill-pointer n))
            ((and (= n l) (not (listen in)))
             ;; We got the exact length of the string and the stream
             ;; is at EOF.  So the string is fine as is: this will be
             ;; true for traditional Unix encodings where a character
             ;; is a byte and line endings are a single character.
             buf)
            (t
             ;; This is unexpected: the file is longer in characters
             ;; than it is in bytes.  This code is UNTESTED since the
             ;; only case I can engineer for it involves a race
             ;; between something which is appending to the file and
             ;; this code, and that test is too hard to set up.
             (labels ((get-more (start chunk-size)
                        (let ((size (+ start chunk-size)))
                          (adjust-array buf (list size) :fill-pointer size)
                          (let ((n (read-sequence buf in :start start)))
                            (cond ((< n chunk-size)
                                   ;; we're done: set the fill pointer
                                   ;; right and return
                                   (adjust-array buf (list size)
                                                 :fill-pointer (+ start n)))
                                  ((and (= n chunk-size) (not (listen in)))
                                   ;; We're also done: we got the
                                   ;; exact number of characters we
                                   ;; had allocated fortuitously
                                   buf)
                                  (t
                                   ;; there is more to get
                                   (get-more (+ start chunk-size) chunk-size)))))))
               (get-more l (ceiling (* l growth-factor)))))))))