Question

我试图在ruby中读取.txt文件并逐行拆分文本。

这是我的代码：

def file_read(filename)
  File.open(filename, 'r').read
end

puts f = file_read('alice_in_wonderland.txt')

完美无缺。但是当我像这样添加方法line_cutter时：

def file_read(filename)
  File.open(filename, 'r').read
end

def line_cutter(file)
  file.scan(/\w/)
end

puts f = line_cutter(file_read('alice_in_wonderland.txt'))

我收到错误：

`scan': invalid byte sequence in UTF-8 (ArgumentError)

我在网上发现了this不受信任的网站，并尝试将其用于我自己的代码，但它无效。如何删除此错误？

链接到文件：File

Answer 1

链接的文本文件包含以下行：

字符集编码：ISO-8859-1

如果转换它是不合适或不可能的话，那么你必须告诉Ruby这个文件是ISO-8859-1编码的。否则使用default external encoding（在您的情况下为UTF-8）。可能的方法是：

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding  # => #<Encoding:ISO-8859-1>

如果你更喜欢你的字符串UTF-8编码（参见utf8everywhere.org），或者甚至喜欢这样：

s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding  # => #<Encoding:UTF-8>

Answer 2

如果你直接从页面阅读文件似乎有效，也许你的本地副本有一些有趣的东西。试试这个：

require 'net/http'

uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

`scan＆＃39;：UTF-8中无效的字节序列（ArgumentError）

2 个答案: