Question

我正在使用ruby 1.9来解析以下带有 MacRoman 字符的csv文件

# encoding: ISO-8859-1
#csv_parse.csv
Name, main-dialogue
"Marceu", "Give it to him ó he, his wife."

我做了以下解析。

require 'csv'
input_string = File.read("../csv_parse.rb").force_encoding("ISO-8859-1").encode("UTF-8")
 #=> "Name, main-dialogue\r\n\"Marceu\", \"Give it to him  \x97 he, his wife.\"\r\n"

data = CSV.parse(input_string, :quote_char => "'", :col_sep => "/\",/")
 #=> [["Name, main-dialogue"], ["\"Marceu", " \"Give it to him  \x97 he, his wife.\""]]

所以，问题是数据中的第二个数组是单个字符串而不是2个字符串，如： ["\"Marceu\"", " \"Give it to him \x97 he, his wife.\""]] 我尝试使用:col_sep => ","（这是默认行为），但它给了我3个分割。

header = CSV.parse(input_string, :quote_char => "'")[0].map{|a| a.strip.downcase unless a.nil? }
 #=> ["Name", "main-dialogue"]

我要再次解析标题，因为这里没有双引号。

输出有意再次显示在浏览器中，因此字符ó应该像往常一样显示，而不是\x97或其他。

有没有办法解决上述问题？

Answer 1

我认为你确实有MacRoman个编码数据;如果你在irb中执行此操作：

>> "\x97".force_encoding('MacRoman').encode('UTF-8')

你明白了：

=> "ó"

这似乎是你期待的角色。所以你想要这个：

input_string = File.read("../csv_parse.rb").force_encoding('MacRoman').encode('UTF-8')

然后，您的CSV中有两列，这些列引用双引号（因此您不需要:quote_char），分隔符为', '所以这应该有效：

data = CSV.parse(input_string, :col_sep => ", ")

和data将如下所示：

[
    ["Name", "main-dialogue"],
    ["Marceu", "Give it to him  ó he, his wife."]
]

Answer 2

在我看来，您错误地使用了:quote_char和:col_sep选项。

第一个应该是用于封闭字段的字符，即'"'表示您显示的数据，而:col_sep应该只是","

上一个示例中显示的双引号只是ruby格式化输出。

使用逗号，双引号和编码解析csv

2 个答案: